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Abstract 

This document is a pedagogical introduction to statistics for particle physics. 
Emphasis is placed on the terminology, concepts, and methods being used at 
the Large Hadron Collider. The document addresses both the statistical tests 
applied to a model of the data and the modeling itself . I expect to release 
updated versions of this document in the future. 



Contents 


1 Introduction. [3] 

2 Conceptual building blocks for modeling. [3] 

2.1 Probability densities and the likelihood function. [3] 

2.2 Auxiliary measurements . |5] 

2.3 Frequentist and Bayesian reasoning. |6] 

2.4 Consistent Bayesian and Frequentist modeling of constraint terms . |7] 

3 Physics questions formulated in statistical language. [8] 

3.1 Measurement as parameter estimation. [8] 

3.2 Discovery as hypothesis tests. |9] 

3.3 Excluded and allowed regions as confidence intervals. [TT] 

4 Modeling and the Scientific Narrative. [14] 

4.1 Simulation Narrative . M 

4.2 Data-Driven Narrative. [25] 

4.3 Effective Model Narrative. ]27] 

4.4 The Matrix Element Method. [27] 

4.5 Event-by-event resolution, conditional modeling, and Punzi factors. ]28] 

5 Erequentist Statistical Procedures. [28] 

5.1 The test statistics and estimators of/X and 0. ]29] 

5.2 The distribution of the test statistic and p-values. . M 

5.3 Expected sensitivity and bands. [32 

5.4 Ensemble of pseudo-experiments generated with “Toy” Monte Carlo. [33] 

5.5 Asymptotic Eormulas. [33] 

5.6 Importance Sampling. [36] 

5.7 Eook-elsewhere effect, trials factor, Bonferoni. [37] 

5.8 One-sided intervals, CEs, power-constraints, and Negatively Biased Relevant Subsets . . [38] 

6 Bayesian Procedures. [39] 

6.1 Hybrid Bayesian-Erequentist methods. [39] 

6.2 Markov Chain Monte Carlo and the Metropolis-Hastings Algorithm. ]40] 

6.3 Jeffreys’s and Reference Prior. ]40] 

6.4 Eikelihood Principle. ]4T] 

7 Unfolding. ]42 

8 Conclusions. ]42 


2 

































1 Introduction 


It is often said that the language of science is mathematics. It could well be said that the language of 
experimental science is statistics. It is through statistical concepts that we quantify the correspondence 
between theoretical predictions and experimental observations. While the statistical analysis of the data 
is often treated as a final subsidiary step to an experimental physics result, a more direct approach would 
be quite the opposite. In fact, thinking through the requirements for a robust statistical statement is an 
excellent way to organize an analysis strategy. 

In these lecture note^ I will devote significant attention to the strategies used in high-energy 
physics for developing a statistical model of the data. This modeling stage is where you inject your 
understanding of the physics. I like to think of the modeling stage in terms of a conversation. When 
your colleague asks you over lunch to explain your analysis, you tell a story. It is a story about the signal 
and the backgrounds - are they estimated using Monte Carlo simulations, a side-band, or some data- 
driven technique? Is the analysis based on counting events or do you use some discriminating variable, 
like an invariant mass or perhaps the output of a multivariate discriminant? What are the dominant 
uncertainties in the rate of signal and background events and how do you estimate them? What are the 
dominant uncertainties in the shape of the distributions and how do you estimate them? The answer to 
these questions forms a scientific narrative', the more convincing this narrative is the more convincing 
your analysis strategy is. The statistical model is the mathematical representation of this narrative and 
you should strive for it to be as faithful a representation as possible. 

Once you have constructed a statistical model of the data, the actual statistical procedures should 
be relatively straight forward. In particular, the statistical tests can be written for a generic statistical 
model without knowledge of the physics behind the model. The goal of the RooStats project was 
precisely to provide statistical tools based on an arbitrary statistical model implemented with the RooFit 
modeling language. While the formalism for the statistical procedures can be somewhat involved, the 
logical justification for the procedures is based on a number of abstract properties for the statistical 
procedures. One can follow the logical argument without worrying about the detailed mathematical 
proofs that the procedures have the required properties. Within the last five years there has been a 
significant advance in the field’s understanding of certain statistical procedures, which has led to to some 
commonalities in the statistical recommendations by the major LHC experiments. I will review some of 
the most common statistical procedures and their logical justification. 

2 Conceptual building blocks for modeling 
2.1 Probability densities and the likelihood function 

This section specifies my notations and conventions, which I have chosen with some carej^ Our statistical 
claims will be based on the outcome of an experiment. When discussing frequentist probabilities, one 
must consider ensembles of experiments, which may either be real, based on computer simulations, or 
mathematical abstraction. 

Figure [T]establishes a hierarchy that is fairly general for the context of high-energy physics. Imag¬ 
ine the search for the Higgs boson, in which the search is composed of several “channels” indexed by c. 
Here a channel is defined by its associated event selection criteria, not an underlying physical process. 
In addition to the number of selected events, ric, each channel may make use of some other measured 
quantity, Xc, such as the invariant mass of the candidate Higgs boson. The quantities will be called “ob¬ 
servables” and will be written in roman letters e.g. Xc- The notation is chosen to make manifest that the 
observable x is frequentist in nature. Replication of the experiment many times will result in different 

'These notes borrow significantly from other documents that I am writing contemporaneously; specifically Ref. Q], docu¬ 
mentation for HistFactory ^ and the ATLAS Higgs combination. 

^As in the case of relativity, notational conventions can make some properties of expressions manifest and help identify 
mistakes. For example, is manifestly Lorentz invariant and -\- yv is manifestly wrong. 
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values of x and this ensemble gives rise to a probability density function (pdf) of x, written f{x), which 
has the important property that it is normalized to unity 

j f{x) dx =1 . 

In the case of discrete quantities, such as the number of events satisfying some event selection, the 
integral is replaced by a sum. Often one considers a parametric family of pdfs 

f{x\a) , 

read “/ of x given a” and, henceforth, referred to as a probability model or just model. The parameters 
of the model typically represent parameters of a physical theory or an unknown property of the detector’s 
response. The parameters are not frequentist in nature, thus any probability statement associated with a 
is Bayesian]^ In order to make their lack of frequentist interpretation manifest, model parameters will be 
written in greek letters, e.g.: p,, 9, a, i/j^From the full set of parameters, one is typically only interested 
in a few: the parameters of interest. The remaining parameters are referred to as nuisance parameters, 
as we must account for them even though we are not interested in them directly. 

While f{x) describes the probability density for the observable x for a single event, we also need 
to describe the probability density for a dataset with many events, V = {xi, ..., x„}. If we consider the 
events as independently drawn from the same underlying distribution, then clearly the probability density 
is just a product of densities for each event. However, if we have a prediction that the total number of 
events expected, call it a, then we should also include the overall Poisson probability for observing n 
events given v expected. Thus, we arrive at what statisticians call a marked Poisson model, 

n 

i{V\v,a) = Vo\s{n\v)'^f{xe\a), (1) 

e=l 

where I use a bold f to distinguish it from the individual event probability density /(x). In prac¬ 
tice, the expectation is often parametrized as well and some parameters simultaneously modify the ex¬ 
pected rate and shape, thus we can write v —)■ a (a). In RooFit both / and f are implemented with 
a RooAbsPdf; where RooAbsPdf: :getVal(x) always provides the value of f{x) and depending on 
RooAbsPdf : : extendMode () the value of a is accessed via RooAbsPdf : : expectedEvents (). 

The likelihood function L{a) is numerically equivalent to f{x\a) with x fixed - or ^{T>\a) with 
V fixed. The likelihood function should nof be inferprefed as a probabilify densify for a. In particular, 
fhe likelihood funclion does nof have fhe properfy fhaf if normalizes fo unify 

Not True! 

If is common fo work wifh fhe log-likelihood (or negafive log-likelihood) funclion. In fhe case of a 
marked Poisson, we have whaf is commonly referred fo as an extended likelihood Q 

n 

— In L(q;) = z/(q;) — n In z/(a) — In/(xe)-h . 

extended term constant 

To reilerale fhe lerminology, probability density function refers fo fhe value of / as a funclion of x given 
a fixed value of a; likelihood function refers fo fhe value of / as a funclion of a given a fixed value of x; 
and model refers fo fhe full struclure of /(x|a;). 

^Note, one can define a conditional distribution f{x\y) when the joint distribution f{x, y) is defined in a frequentist sense, 
"'while it is common to write s and b for the number of expected signal and background, these are parameters not observ¬ 
ables, so I will write us and ub- This is one of few notational differences to Ref. Q. 
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Probability models can be constructed to simultaneously describe several channels, that is several 
disjoint regions of the data defined by the associated selection criteria. I will use e as the index over 
events and c as the index over channels. Thus, the number of events in the channel is ric and the 
value of the event in the channel is Xce- In this context, the data is a collection of smaller datasets: 

^sim = {^1) • • ■ ) ^Cmax} = {{xc=l,e=l • • • Xc=l,e=nc}) • • • {^c=Cniax,e=l • • • ^c=Cmax,e=nc^^^}}■ In RooFit 
the index c is referred to as a RooCategory and it is used to inside the dataset to differentiate events as¬ 
sociated to different channels or categories. The class RooSimultaneous associates the dataset Vc with 
the corresponding marked Poisson model. The key point here is that there are now multiple Poisson 
terms. Thus we can write the combined (or simultaneous) model 




a = 


n 


cGchannels . 


Pois(nc|i^(a)) n/( ^ce\^) 


e=l 


( 2 ) 


remembering that the symbol product over channels has implications for the structure of the dataset. 



Fig. 1: A schematic diagram of the logical stmcture of a typical particle physics probability model and dataset 
structures. 


2.2 Auxiliary measurements 

Auxiliary measurements or control regions can be used to estimate or reduce the effect of systematic 
uncertainties. The signal region and control region are not fundamentally different. In the language that 
we are using here, they are just two different channels. 

A common example is a simple counting experiment with an uncertain background. In the fre- 
quentist way of thinking, the true, unknown background in the signal region is a nuisance parameter, 
which I will denote If we call the true, unknown signal rate vs and the number of events in the 
signal region nsR then we can write the model Pois(nsR|i^s + vb)- As long as vb is a free parameter, 

^Note, you can think of a counting experiment in the context of Eq.j^with f{x) = 1, thus it reduces to just the Poisson 
term. 
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there is no ability to make any useful inference about us- Often we have some estimate for the back¬ 
ground, which may have come from some control sample with ncR events. If the control sample has no 
signal contamination and is populated by the same background processes as the signal region, then we 
can write Pois(ncR|Ti^B), where ncR is the number of events in the control region and r is a factor used 
to extrapolate the background from the signal region to the control region. Thus the total probability 
model can be written fsim(rasR, ncnli^s, ^b) = Pois(nsR |^'5 + i^b) ■ Pois(ncR|Tz/s). This is a special 
case of Eq.j^and is often referred to as the “on/off’ problem 0. 


Based on the control region alone, one would estimate (or ‘measure’) vb = bcr/t. Intuitively the 
estimate comes with an ‘uncertainty’ of We will make these points more precise in Sec. 3.1 but 

the important lesson here is that we can use auxiliary measurements (ie. tt-cr) to describe our uncertainty 
on the nuisance parameter vb statistically. Furthermore, we have formed a statistical model that can be 
treated in a frequentist formalism - meaning that if we repeat the experiment many times ncR will vary 
and so will the estimate of vb- It is common to say that auxiliary measurements ‘constrain’ the nuisance 
parameters. In principle the auxiliary measurements can be every bit as complex as the main signal 
region, and there is no formal distinction between the various channels. 


The use of auxiliary measurements is not restricted to estimating rates as in the case of the on/off 
problem above. One can also use auxiliary measurements to constrain other parameters of the model. 
To do so, one must relate the effect of some common parameter Up in multiple channels (ie. the signal 
region and a control regions). This is implicit in Eq.[^ 


2.3 Frequentist and Bayesian reasoning 

The intuitive interpretation of measurement of vb to be ttcr/t ± y^ncR/r is that the parameter vb has 
a distribution centered around ticr/t with a width of y/ncR/r. With some practice you will be able 
to immediately identify this type of reasoning as Bayesian. It is manifestly Bayesian because we are 
referring to the probability distribution of a parameter. The frequentist notion of probability of an event 
is defined as the limit of its relative frequency in a large number of trials. The large number of trials 
is referred to as an ensemble. In particle physics the ensemble is formed conceptually by repeating the 
experiment many times. The true values of the parameters, on the other hand, are states of nature, not the 
outcome of an experiment. The true mass of the Z boson has no frequentist probability distribution. The 
existence or non-existence of the Higgs boson has no frequentist probability associated with it. There is 
a sense in which one can talk about the probability of parameters, which follows from Bayes’s theorem: 

nm - . < 3 ) 

Bayes’s theorem is a theorem, so there’s no debating it. It is not the case that Frequentists dispute whether 
Bayes’s theorem is true. The debate is whether the necessary probabilities exist in the first place. If one 
can define fhe joinf probabilify P{A, B) in a frequenfisf way, fhen a Frequenfisf is perfecfly happy using 
Bayes fheorem. Thus, fhe debafe sfarfs af fhe very definifion of probabilify. 

The Bayesian definifion of probabilify clearly can’f be based on relafive frequency. Insfead, if 
is based on a degree of belief. Formally, fhe probabilify needs fo safisfy Kolmogorov’s axioms for 
probabilify, which bofh fhe frequenfisf and Bayesian definifions of probabilify do. One can quanfify 
degree of belief fhrough belling odds, Ihus Bayesian probabilifies can be assigned fo hypofheses on 
stales of nalure. In praclice human’s bels are nol generally nol ‘coherenl’ (see ‘dulch book’), Ihus Ihis 
way of quantifying probabilifies may nol safisfy fhe Kolmogorov axioms. 

Moving pasf fhe philosophy and accepting fhe Bayesian procedure al face value, fhe pracfical 
consequence is lhal one musl supply prior probabilifies for various paramefer values and/or hypofheses. 
In particular, fo inlerprel our example measurement of ncR as implying a probability distribution for vb 
we would write 

vr(^'R|ncR) oc f{ncR\vB)'ri{fB) , (4) 
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where 7r(z/B|ncR) is called the posterior probability density, /(ncRlz^s) is the likelihood function, and 
r]{vB) is the prior probability. Here I have suppressed the somewhat curious term H(ncR), which can 
be thought of as a normalization constant and is also referred to as the evidence. The main point here is 
that one can only invert ‘the probability of ncR given vb" to be ‘the probability of ub given ucr’ if one 
supplies a prior. Humans are very susceptible to performing this logical inversion accidentally, typically 
with a uniform prior on vb- Furthermore, the prior degree of belief cannot be derived in an objective 
way. There are several formal rules for providing a prior based on formal rules (see Jefferey’s prior and 
Reference priors), though these are not accurately described as representing a degree of belief. Thus, 
that style of Bayesian analysis is often referred to as objective Bayesian analysis. 

Some useful and amusing quotes on Bayesian and Frequentist reasoning: 

“Using Bayes’s theorem doesn ’t make you a Bayesian, always using Bayes’s theorem makes 
you a Bayesian.” -unknown 

“Bayesians address the questions everyone is interested in by using assumptions that no 
one believes. Frequentist use impeccable logic to deal with an issue that is of no interest to 
anyone.”- Louis Lyons 


2.4 Consistent Bayesian and Frequentist modeling of constraint terms 


Often a detailed probability model for an auxiliary measurement are not included directly into the model. 
If the model for the auxiliary measurement were available, it could and should be included as an addi¬ 
tional channel as described in Sec. |2.2| The more common situation for background and systematic 


uncertainties only has an estimate, “central value”, or best guess for a parameter ap and some notion 
of uncertainty on this estimate. In this case one typically resorts to including idealized terms into the 
likelihood function, here referred to as “constraint terms”, as surrogates for a more detailed model of the 
auxiliary measurement. I will denote this estimate for the parameters as Op, to make it manifestly fre¬ 
quentist in nature. In this case there is a single measurement of Op per experiment, thus it is referred to as 
a “global observable” in RooStats. The treatment of constraint terms is somewhat ad hoc and discussed 
in more detail in Section 4.1.6 I make it a point to write constraint terms in a manifestly frequentist form 

/ (®p|c^p)- 


Probabilities on parameters are legitimate constructs in a Bayesian setting, though they will always 
rely on a prior. In order to distinguish Bayesian pdfs from frequentist ones, greek letters will be used for 
their distributions. For instance, a generic Bayesian pdf might be written 7r(Q;). In the context of a main 
measurement, one might have a prior for ap based on some estimate Op. In this case, the prior 7r(ap) 
is really a posterior from some previous measurement. It is desirable to write with the help of Bayes 
theorem 

7r{ap\ap) oc L{ap)r]{ap) = f{ap\ap)r]{ap) , (5) 


where ri{ap) is some more fundamental priorj^ By taking the time to undo the Bayesian reasoning into 
an objective pdf or likelihood and a prior we are able to write a model that can be used in a frequentist 
context. Within RooStats, the care is taken to separately track the frequentist component and the prior; 
this is achieved with the ModelConf ig class. 


If one can identify what auxiliary measurements were performed to provide the estimate of ap and 
its uncertainty, then it is not a logical fallacy to approximate it with a constraint term, it is simply a con¬ 
venience. However, not all uncertainties that we deal result from auxiliary measurements. In particular, 
some theoretical uncertainties are not statistical in nature. For example, uncertainty associated with the 


®Glen Cowan has referred to this more fundamental prior as an ’urprior’, which is based on the German use of ’ur’ for 
forming words with the sense of ‘proto-, primitive, original’. 
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choice of renormalization and factorization scales and missing higher-order corrections in a theoretical 
calculation are not statistical. Uncertainties from parton density functions are a bit of a hybrid as they are 
derived from data but require theoretical inputs and make various modeling assumptions. In a Bayesian 
setting there is no problem with including a prior on the parameters associated to theoretical uncertain¬ 
ties. In contrast, in a formal frequentist setting, one should not include constraint terms on theoretical 
uncertainties that lack a frequentist interpretation. That leads to a very cumbersome presentation of re¬ 
sults, since formally the results should be shown as a function of the uncertain parameter. In practice, 
the groups often read Eq.|^to arrive at an effective frequentist constraint term. 

I will denote the set of parameters with constraint terms as S and the global observables Q = {op} 
with p G §. By including the constraint terms explicitly (instead of implicitly as an additional channel) 
we arrive at the total probability model, which we will not need to generalize any further: 


ftot(Aira,^|a) 


n 


cGchannels . 


Pois(nc|r'c(a)) fd 


X r.p. Q!) 


e=l 


■ fpi^pWp) ■ 

peS 


( 6 ) 


3 Physics questions formulated in statistical language 
3.1 Measurement as parameter estimation 

One of the most common tasks of the working physicist is to estimate some model parameter. We do it 
so often, that we often don’t realize it. For instance, the sample mean x = X]e=i is an estimate for 
the mean, /r, of a Gaussian probability density /(x|/i, a) = Gauss(x|p, a). More generally, an estimator 
a{'D) is some function of the data and its value is used to estimate the true value of some parameter a. 
There are various abstract properties such as variance, bias, consistency, efficiency, robustness, etc Q. 
The bias of an estimator is defined as ^(q;) = E[a] — a, where E means the expectation value of 
E[a] = f d(x)f(x)dx or the probability-weighted average. Clearly one would like an unbiased estima¬ 
tor. The variance of an estimator is defined as var[d] = E[{a — £'[a])^]; and clearly one would like 
an estimator with the minimum variance. Unfortunately, there is a tradeoff between bias and variance. 
Physicists tend to be allergic to biased estimators, and within the class of unbiased estimators, there is 
a well defined minimum variance bound referred to as the Cramer-Rao bound (that is the inverse of the 
Fisher information, which we will refer to again later). 

The most widely used estimator in physics is the maximum likelihood estimator (MFE). It is 
defined as the value of a which maximizes the likelihood function L{a). Equivalently this value, d, 
maximizes log L{a) and minimizes — log L(a). The most common tool for finding the maximum likeli¬ 
hood estimator is Minuit, which conventionally minimizes — logL(a) (or any other function) ij^. The 
jargon is that one ‘fits’ the function and the maximum likelihood estimate is the ‘best fit value’. 

When one has a multi-parameter likelihood function L{a), then the situation is slightly more 
complicated. The maximum likelihood estimate for the full parameter list, d, is clearly defined. The 
various components dp are referred to as the unconditional maximum likelihood estimates. In the physics 
jargon, one says all the parameters are ‘floating’. One can also ask about maximum likelihood estimate 
of Up is with some other parameters olq fixed; this is called the conditional maximum likelihood estimate 
and is denoted dp{oLo). These are important quantities for defining the profile likelihood ratio, which 
we will discuss in more detail later. The concept of variance of the estimates is also generalized to 
the covariance matrix cov[ap,api] = E[{dp — ap){dpi — a^/)] and is often denoted Spp/. Note, the 
diagonal elements of the covariance matrix are the same as the variance for the individual parameters, ie. 
cov\ap,ap] = t;ar[ap]. 

In the case of a Poisson model Pois(n|z^) the maximum likelihood estimate of v is simply v = n. 
Thus, it follows that the variance of the estimator is var[i>] = var[n] = u. Thus if the true rate is v one 
expects to find estimates D with a characteristic spread around v, it is in this sense that the measurement 
has a estimate has some uncertainty or ‘error’ of yTi. We will make this statement of uncertainty more 




precise when we discuss frequentist confidence intervals. 

When the number of events is large, the distribution of maximum likelihood estimates approaches 
a Gaussian or normal distribution|^ This does not depend on the pdf f{x) having a Gaussian form. For 
small samples this isn’t the case, but this limiting distribution is often referred to as an asymptotic dis¬ 
tribution. Furthermore, under most circumstances in particle physics, the maximum likelihood estimate 
approaches the minimum variance or Cramer-Rao bound. In particular, the inverse of the covariance 
matrix for the estimates is asymptotically given by 


S„;(a) = E 


8“^ log/(x|Q:) 


dapdp! 


ex 


(V) 


where I have written explicitly that the expectation, and thus the covariance matrix itself, depend on the 
true value ex. The right side of Eq. [7] is called the (expected) Fisher information matrix. Remember 
that the expectation involves an integral over the observables. Since that integral is difficult to perform 
in general, one often uses the observed Fisher information matrix to approximate the variance of the 
estimator by simply taking the matrix of second derivatives based on the observed data 


E-2ia) = - 


PP 


8^ log L{cx) 

8ap8p' 


( 8 ) 


This is what Minuit’s Hesse algorithrrj^calculates to estimate the covariance matrix of the parameters. 


3.2 Discovery as hypothesis tests 

Let us examine the statistical statement associated to the claim of discovery for new physics. Typically, 
new physics searches are looking for a signal that is additive on top of the background, though in some 
cases there are interference effects that need to be taken into account and one cannot really talk about 
’signal’ and ’background’ in any meaningful way. Discovery is formulated in terms of a hypothesis 
test where the background-only hypothesis plays the role of the null hypothesis and the signal-plus- 
background hypothesis plays the roll of the alternative. Roughly speaking, the claim of discovery is a 
statement that the data are incompatible with the background-only hypothesis. Consider the simplest 
scenario where one is counting events in the signal region, nsR and expects vb events from background 
and Vs events from the putative signal. Then we have the following hypotheses: 

symbol statistical name physics name probability model 

Hq null hypothesis background-only Pois(n 5 R|z^R) 

Hi alternate hypothesis signal-plus-background Pois(n 5 R|z^ 5 -|-i/r) 

In this simple example it’s fairly obvious that evidence for a signal shows up as an excess of events and 
a reasonable way to quantify the compatibility of the observed data n^cR hypothesis is to 

calculate the probability that the background-only would produce at least this many events; the p-value 

OO 

P = E Pois(n|i^R) . (9) 

n=n%j^ 

If this p-value is very small, then one might choose to reject the null hypothesis. 

Note, the p-value is not a to be interpreted as the probability of the null hypothesis given the data - 
that is a manifestly Bayesian statement. Instead, the p-value is a statement about the probability to have 
obtained data with a certain property assuming the null hypothesis. 

’There are various conditions that must be met for this to be true, but skip the fine print in these lectures. There are two 
conditions that are most often violated in particle physics, which will be addressed later. 

*The matrix is called the Hessian, hence the name. 
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How do we generalize this to more complicated situations? There were really two ingredients in 
our simple example. The first was the proposal that we would reject the null hypothesis based on the 
probability for it to produce data at least as extreme as the observed data. The second ingredient was 
the prescription for what is meant by more discrepant; in this case the possible observations are ordered 
according to increasing nsR- One could imagine using difference between observed and expected, nsR — 
Ur, as the measure of discrepancy. In general, a function that maps the data to a single real number is 
called a test statistic: T{T>) —)• M. How does one choose from the infinite number of test statistics? 

Neyman and Pearson provided a framework for hypothesis testing that addresses the choice of 
the test statistic. This setup treats the null and the alternate hypotheses in an asymmetric way. First, 
one defines an acceptance region in terms of a test statistic, such that if T{V) < ka one accepts the 
null hypothesis. One can think of the T{V) = ka as defining a contour in the space of the data, which 
is the boundary of this acceptance region. Next, one defines the size of the test, a|^ as the probability 
the null hypothesis will be rejected when it is true (a so-called Type-I error). This is equivalent to 
the probability under the null hypothesis that the data will not be found in this acceptance region, ie. 
a = P{T{V) > ka\Ho)- Note, it is now clear why there is a subscript on ka, since the contour level is 
related to the size of the test. In contrast, if one accepts the null hypothesis when the alternate is true, 
it is called a Type-II error. The probability to commit a Type-II error is denoted as /3 and it is given by 
P = P{T{V) < ka\Hi). One calls 1 — /3 the power of the test. With these definitions in place, one 
looks for a test statistic that maximizes the power of the test for a fixed test size. This is a problem for 
the calculus of variations, and sounds like it might be very difficult for complicated probability models. 

It turns out that in the case of two simple hypotheses (probability models without any parameters), 
there is a simple solution! In particular, the test statistic leading to the most powerful test is given by the 
likelihood ratio Tatp (2?) = f(2?|22i)/f(2?|22o)- This result is referred to as the Neyman-Pearson lemma, 
and I will give an informal proof. We will prove this by considering a small variation to the acceptance 
region defined by the likelihood ratio. The solid red contour in Fig. [^represents the rejection region 
(the complement to the acceptance region) based on the likelihood ratio and the dashed blue contour 
represents a small perturbation. If we can say that any variation to the likelihood ratio has less power, 
then we will have proved the Neyman-Pearson lemma. The variation adds (the left, blue wedge) and 
removes (the right, red wedge) rejection regions. Because the Neyman-Pearson setup requires that both 
tests have the same size, we know that the probability for the data to be found in the two wedges must be 
the same under the null hypothesis. Because the two regions are on opposite sides of the contour defined 
by f(2?|F2i)/f (2?|iFo), then we know that the data is less likely to be found in the small region that we 
added than the small region we subtracted assuming the alternate hypothesis. In other words, there is 
less probability to reject the null when the alternate is true; thus the test based on the new contour is less 
powerful. 

How does this generalize for our most general model in Eq. [^with many free parameters? First 
one must still define the null and the alternate hypotheses. Typically is done by saying some parameters 
- the parameters of interest ccpoi - take on specific values takes on a particular value for the signal- 
plus-background hypothesis and a different value for the background-only hypothesis. For instance, 
the signal production cross-section might be singled out as the parameter of interest and it would take 
on the value of zero for the background-only and some reference value for the signal-plus-background. 
The remainder of the parameters are called the nuisance parameters Qnuis- Unfortunately, there is no 
equivalent to the Neyman-Pearson lemma for models with several free parameters - so called, composite 
models. Nevertheless, there is a natural generalization based on the profile likelihood ratio. 

Remembering that the test statistic T is a real-valued function of the data, then any particular 
probability model ftot(2^|Q!) implies a distribution for the test statistic f{T\a). Note, the distribution for 
the test statistic depends on the value of a. Below we will discuss how one constructs this distribution, 
but lets take it as given for the time being. Once one has the distribution, then one can calculate the 

®Note, a is the conventional notation for the size of the test, and has nothing to do with a model parameter in Eq. 2 
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P{x\Hi) 

P{x\Ho) 


< k, 



P{ kJHo) = P{^\Ho) 



> ka 


P( KJH^)<P(y\H^) 


Fig. 2: A graphical proof of the Neyman-Pearson lemma. 


p-value is given by 

roc p 

p(a)= / fiT\cx)dT= iiV\cx)e{T(p)- To) dV = P{T>To\cx) , (10) 

JTo J 

where Tq is the value of the test statistic based on the observed data and 0(-) is the Heaviside functionj^ 
Usually the p-value is just written as p, but I have written it as p{oc) to make its cr-dependence explicit. 

Given that the p-value depends on cr, how does one decide to accept or reject the null hypothesis? 
Remembering that crpoi takes on a specific value for the null hypothesis, we are worried about how the 
p-value changes as a function of the nuisance parameters. It is natural to say that one should not reject the 
null hypothesis if the p-value is larger than the size of the test/or any value of the nuisance parameters. 
Thus, in a frequentist approach one should either present p-value explicitly as a function of Qnuis or take 
its maximal (or supremum) value 

Psup(Q:poi) — sup p(Q!nuis) • (11) 

^nuis 

As a final note it is worth mentioning that the size of the test, which serves as the threshold for 
rejecting the null hypothesis, is purely conventional. In most sciences conventional choices of the size 
are 10%, 5%, or 1%. In particle physics, our conventional threshold for discovery is the infamous 5cr 
criterion - which is a conventional way to refer to a = 2.87 • 10“^. This is an incredibly small rate of 
Type-I error, reflecting that claiming the discovery of new physics would be a monumental statement. 
The origin of the Scr criterion has its roots in the fact that traditionally we lacked the tools to properly 
incorporate systematics, we fear that there are systematics that may not be fully under control, and we 
perform many searches for new physics and thus we have many chances to reject the background-only 
hypothesis. We will return to this in the discussion of the look-elsewhere effect. 

3.3 Excluded and allowed regions as confidence intervals 

Often we consider a new physics model that is parametrized by theoretical parameters. For instance, the 
mass or coupling of a new particle. In that case we typically want to ask what values of these theoretical 

''’The integral J dV is a bit unusual for a marked Poisson model, because it involves both a sum over the number of events 
and an integral over the values of Xe for each of those events. 
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parameters are allowed or excluded given available data. Figure shows two examples. Figure |^a) 
shows an example with ccpoi = {cr/asM, ^h), where ajasM is the ratio of the production cross-section 
for the Higgs boson with respect to its prediction in the standard model and Mh is the unknown Higgs 
mass parameter in the standard model. All the parameter points above the solid black curve correspond 
to scenarios for the Higgs boson that are considered ‘excluded at the 95% confidence level’. Figure [^b) 
shows an example with Qpoi = {my/.,mt) where mw is the mass of the VF-boson and mt is the mass 
of the top quark. We have discovered the W -boson and the top quark and measured their masses. The 
blue ellipse ‘is the 68% confidence level confour’ and all fhe parameter poinfs inside if are considered 
‘consisfenf wifh dafa af fhe la level’. Whaf is fhe precise meaning of fhese sfafemenfs? 




Mh [GeV] 


(a) (b) 

Fig. 3: Two examples of confidence intervals. 


In a frequentist setting, these allowed regions are called confidence intervals or confidence regions, 
and the parameter points outside them are considered excluded. Associated with a confidence inferval 
is a confidence level, i.e. fhe 95% and 68% confidence level in fhe fwo examples. If we repeal fhe 
experimenfs and oblain differenf dafa, fhen fhese confidence infervals will change. If is useful fo fhink of 
fhe confidence infervals as being random in fhe same way fhe dafa are random. The defining properly of 
a 95% confidence inferval is lhal if covers fhe Irue value 95% of fhe lime. 

How can one possibly consfrucl a confidence inferval has fhe desired properly, fhaf if covers fhe 
frue value wifh a specified probabilify, given fhaf we don’f know fhe Irue value? The procedure for 
building confidence intervals is called the Neyman Construction 0, and it is based on ‘inverting’ a 
series of hypothesis tests (as described in Sec. |3.2| ). In particular, for each value of a in the parameter 
space one performs a hypothesis test based on some test statistic where the null hypothesis is a. Note, 
that in this context, the null hypothesis is changing for each test and generally is not the background- 
only. If one wants a 95% confidence interval, fhen one conslrucls a series of hypolhesis fesl wifh a size 
of 5%. The confidence interval I{'D) is consfrucled by faking fhe sel of paramefer poinfs where fhe null 
hypolhesis is accepted. 

I{V) = {a\P{T{V)> kc,\a) <a} , (12) 

where fhe final a and fhe subscripl ka refer lo fhe size of fhe fesl. Since a hypolhesis fesl wifh a size 
of 5% should accepl fhe null hypolhesis 95% of fhe lime if if is Irue, confidence infervals constructed in 
this way satisfy the defining properly. This same properly is usually formulaled in terms of coverage. 
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Coverage is the probability that the interval will contain (cover) the parameter a when it is true, 

coverage(Q:) = P{a G 11 a) . (13) 

The equation above can easily be mis-interpreted as the probability the parameter is in a fixed interval 
/; but one must remember that in evaluating the probability above the data V, and, thus, the corre¬ 
sponding intervals produced by the procedure /(T>), are the random quantities. Note, that coverage is a 
property that can be quantified for any procedure fhaf produces fhe confidence infervals I. Infervals pro¬ 
duced using the Neyman Construction procedure are said to “cover by construction”; however, one can 
consider alternative procedures that may either under-cover or over-cover. Undercoverage means that 
P{a G / I q) is smaller than desired and over-coverage means that P{a G / | a) is larger than desired. 
Note that in general coverage depends on the assumed true value a. 

Since one typically is only interested in forming confidence infervals on fhe parameters of inferesf, 
fhen one could use fhe supremum p-value of Eq. [m This procedure ensures fhaf fhe coverage is af teas! 
fhe desired level, fhough for some values of a if may over-cover (perhaps significanlly). This procedure, 
which I call fhe ‘full consfrucfion’, is also compufafionally very intensive when cx has many parameters 
as if require performing many hypofhesis fesfs. In fhe naive approach where each ap is scanned in a 
regular grid, fhe number of paramefer poinfs fesfed grows exponenfially in fhe number of paramefers. 
There is an alfernafive approach, which I call fhe ‘profile consfrucfion’ l|^[^ and which sfafisficians call 
an ‘hybrid resampling fechnique’ | [T0l[TT| fhaf is approximate fo fhe full consfrucfion, buf fypically has 
good coverage properfies. We refum fo fhe procedures and properfies for fhe differenf fypes of Neyman 
Consfrucfions lafer. 





Fig. 4: A schematic visualization of the Neyman Construction. For each value of 9 one finds a region in x 
that satisfies f f(xj9)dx (blue). Together these regions form a confidence belt (green). The intersection of the 
observation xq (red) with the confidence belt defines the confidence interval [0i, 021 - 


Figure [^provides an overview of the classic Neyman construction corresponding to the left panel 
of Fig. The left panel of Fig.j^is taken from the Feldman and Cousins’s paper 1121 where the parameter 
of the model is denoted /r instead of 0. For each value of the parameter /r, the acceptance region in x 
is illustrated as a horizontal bar. Those regions are the ones that satisfy T(P) < ka, and in the case of 
Feldman-Cousins the test statistic is the one of Eq. This presentation of the confidence belf works 
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well for a simple model in which the data consists of a single measurement T> = {a;}- Once one has the 
confidence belt, then one can immediately find the confidence interval for a particular measurement of x 
simply by taking drawing a vertical line for the measured value of x and finding the intersection with the 
confidence belt. 

Unfortunately, this convenient visualization doesn’t generalize to complicated models with many 
channels or even a single channel marked Poisson model where V = {xi,... ,Xn}- In those more 
complicated cases, the confidence belt can still be visualized where the observable x is replaced with T, 
the test statistic itself. Thus, the boundary of the belt is given by ka vs. /i as in the right panel of Figure]^ 
The analog to the vertical line in the left panel is now a curve showing how the observed value of the test 
statistic depends on /r. The confidence interval still corresponds to the intersection of the observed test 
statistic curve and the confidence belt, which clearly satisfies T{T>) < ka- For more complicated models 
with many parameters the confidence belt will have one axis for the test statistic and one axis for each 
model parameter. 
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Fig. 5 : Two presentations of a confidence belt (see text). Left panel taken from Ref. GD- Right panel shows a 
presentation that generalizes to more complicated models. 




- * 

. ^ 


tfj, = -2\nX{fi) 


Pm = 


/ OO 

/i,obs 


dt,, 


-log >.(Lt) 


Note, a 95% confidence interval does not mean that there is a 95% chance that the true value of the 
parameter is inside the interval - that is a manifestly Bayesian statement. One can produce a Bayesian 
credible interval with that interpretation; however, that requires a prior probability distribution over the 
parameters. Similarly, for any fixed interval I one can compute the Bayesian credibility of the interval 


P{a E I\V) 


fj i{T>\a)7r{a)da 
f f(D|Q:)7r(Q;)dQ; 


(14) 


4 Modeling and the Scientific Narrative 

Now that we have established a general form for a probability model (Eq. and we have translated 
the basic questions of measurement, discovery, and exclusion into the statistical language we are ready 
to address the heart of the statistical challenge - building the model. It is difficult to overestimate how 
important the model building stage is. So many of the questions that are addressed to the statistical 
experts in the major particle physics collaborations are not really about statistics per se, but about model 
building. In fact, the first question that you are likely to be asked by one of the statistical experts is “what 
is your model?” 
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Often people are confused by the question “what is your model?” or simply have not written it 
down. You simply can’t make much progress on any statistical questions if you haven’t written down a 
model. Of course, people do usually have some idea for what it is that they want to do The process of 
writing down the model often obviates the answer to the question, reveals some fundamental confusion 
or assumption in the analysis strategy, or both. As mentioned in the introduction, writing down the model 
is intimately related with the analysis strategy and it is a good way to organize an analysis effort. 

I like to think of the modeling stage in terms of a scientific narrative. I find that there are three 
main narrative elements, though many analyses use a mixture of these elements when building the model. 
Below I will discuss these narrative elements, how they are translated into a mathematical formulation, 
and their relative pros and cons. 

4.1 Simulation Narrative 

The simulation narrative is probably the easiest to explain and produces statistical models with the 
strongest logical connection to physical theory being tested. We begin with an relation that every particle 
physicists should know for the rate of events expected from a specific physical process 

rale = (flux) x (cross section) x (efficiency) x (accepfance) , (15) 

where the cross section is predicted from the theory, the flux is controlled by the acceleratoi[^ and the 
efficiency and acceptance are properties of the detector and event selection criteria. It is worth not¬ 
ing that the equation above is actually a repackaging of a more fundamental relationship. In fact the 
fundamental quantity that is predicted from first principles in quantum theory is the scattering proba¬ 
bility P{i —)>/) = |(f|/)P/((fK)(/|/)) inside a box of size V over some time interval T, which is then 
repackaged into the Lorentz invariant form above [?]. 

In the simulation narrative the efficiency and acceptance are estimated with computer simulations 
of the detector. Typically, a large sample of events is generated using Monte Carlo techniques [?]. The 
Monte Carlo sampling is performed separately for the hard (perturbative) interaction (e.g. MadGraph), 
the parton shower and hadronization process (e.g. Pythia and Herwig), and the interaction of particles 
with the detector (e.g. Geant). Note, the efficiency and accepfance depend on the physical process 
considered, and I will refer to each such process as a sample (in reference to the corresponding sample 
of events generated with Monte Carlo techniques). 

To simplify the notation, I will define fhe effecfive cross section, cjefr, to be the product of the total 
cross section, efficiency, and acceptance. Thus, the total number of events expected to be selected for 
a given scattering process, u, is the product of the time-integrated flux or time-integrated luminosity. A, 
and the effective cross section 

n = AcJeff. . (16) 

I use A here instead of the more common L to avoid confusion with the likelihood function and because 
when we incorporate uncertainty on the time-integrated luminosity it will be a parameter of the model 
for which I have chosen to use greek letters. 

If we did not need to worry about detector effects and we could measure the final sfafe perfecfly, 
then the distribution for any observable x would be given by 

(idealized) f{x) = . ( 17 ) 

CJeff. dx 

Of course, we do need to worry about detector effects and we incorporate them with the detector sim¬ 
ulation discussed above. From the Monte Carlo sample of event^jxi , ... ,xn} we can estimate the 

"in some cases, like cosmic rays, the flux must be estimated since the accelerator is quite far away. 

"Here I only consider unweighted Monte Carlo samples, but the discussion below can be generalized for weighted Monte 
Carlo samples. 
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underlying distribution f{x) simply by creating a histogram. If we want we can write the histogram 
based on B bins centered at X}, with bin width wi, explicitly as 

N B 

ru- , ^ \ Ut \ Xb\/Wb)6{\x - Xh\/Wh) 

(histogram) / x) h{x) = > , (18) 

(V Wb 

i=l b=l 

where the first Heaviside function accumulates simulated events in the bin and the second selects the bin 
containing the value of x in question. Histograms are the most common way to estimate a probability 
density function based on a finite sample, but there are other possibilities. The downsides of histograms 
as an estimate for the distribution f{x) is that they are discontinuous and have dependence on the location 
of the bin boundaries. A particularly nice alternative is called kernel estimation | |T3| . In this approach, 
one places a kernel of probability K{x) centered around each event in the sample: 

N / _ A 

(kernel estimate) f{x) - fo{x) = ^ 

The most common choice of the kernel is a Gaussian distribution, and there are results for the optimal 
width of the kernel h. Equation[^is referred to as the fixed kernel esfimafe since h is common for all fhe 
evenfs in fhe sample. A second order esfimafe or adaptive kernel estimation provides heller performance 
when fhe disfribulion is multimodal or has bolh narrow and wide fealures | [T3l |. 

4.1.1 The multi-sample mixture model 

So far we have only considered a single inferacfion process, or sample. How do we form a model 
when Ihere are several scattering processes confribuling lo fhe folal rale and disfribulion of x? From 
firsl principles of quanlum mechanics we musl add Ihese differenl processes logelher. Since Ihere is no 
physical meaning fo label individual processes lhal inlerfere quanlum mechanically, I will consider all 
such processes as a single sample. Thus fhe remaining sel of samples fhal do nof inlerfere simply add 
incoherenfly. The folal rale is simply fhe sum of fhe individual rales 

t'tot = X] 

sGsamples 

and fhe folal disfribulion is a weighled sum called a mixture model 

fix) = — ’ (21) 

.esamples 

where fhe subscripl s has been added fo fhe equafions above for each such sample. Wilh Ihese Iwo 
ingredienfs we can conslrucl our marked Poisson model of Eq.[T]for a single channel, and we can simply 
repeal Ihis for several disjoinl even! selection requiremenfs fo form a multi-channel simulfaneous model 
like Eq. In fhe multi-channel case we will give fhe additional subscripl c G channels lo Vcs, fcsix), 
Uc^tot, and fcix). However, al Ihis poinl, our model has no free parameters a. 

4.1.2 Incorporating physics parameters into the model 

Now we wanl lo paramelrize our model inlerns of some physical parameters a, such as Ihose lhal appear 
in Ihe Eagrangian of a some Iheory. Changing fhe parameters in Ihe Eagrangian of a Iheory will in 
general change bolh fhe folal rate u and Ihe shape of fhe dislribulions fix). In principle, we can repeal 
the procedure above for each value of these parameters cx to form VcsioL) and fcsix\a) for each sample 
and selection channel, and, thus, from fsim(T’|Q:)- In practice, we need to resort to some interpolation 
strategy over the individual parameter points {o:*} where we have Monte Carlo samples. We will return 
to these interpolation strategies later. 
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In some case the only effect of the parameter is to scale the rate of some scattering process z^s(q:) 
without changing its distribution fs{x\a). Furthermore, the scaling is often known analytically, for 
instance, a coupling constants produce a linear relationship like I'iap) = + r'o- In such cases, 

interpolation is not necessary and the parametrization of the likelihood function is straightforward. 

Note, not all physics parameters need be considered parameters of interest. There may be a free 
physics parameter that is not directly of interest, and as such it would be considered a nuisance parameter. 

4.1.2.1 An example, the search for the standard model Higgs boson 

In the case of searches for the standard model Higgs boson, the only free parameter in the Lagrangian is 
rriH- Once mn is specified the rates and the shapes for each of the scattering processes (combinations of 
production and decay modes) are specified by fhe fheory. Of course, as fhe Higgs boson mass changes 
fhe disfribufions do change so we do need fo worry abouf inferpolafing fhe shapes f{x\mH). However 
fhe resulfs are offen presenfed as a raster scan over mn, where one fixes mn and fhen asks abouf fhe rafe 
of signal evenfs from fhe Higgs boson scaffering process. Wifh mn fixed fhis is really a simple hypofh- 
esis fesf befween background-only and signal-plus-backgrouncj^ buf we usually choose fo consfrucf a 
paramefrized model fhaf does nol direcfly correspond fo any fheory. In fhis case fhe paramefer of inferesf 
is some scaling of fhe rale wifh respecl fo fhe slandard model prediction, p, = a/asm, such lhal p = Ois 
the background-only situation and p = lis the standard model prediction. Furthermore, we usually use 
this global p factor for each of the production and decay modes even though essentially all theories of 
physics beyond the standard model would modify the rates of the various scattering processes differently. 
Figure shows confidence intervals on p for fixed values of niH- Values below fhe solid black curve 
are nol excluded (since an arbifrarily small signal rafe cannof be differenlialed from fhe background-only 
and fhis is a one-sided confidence interval). 

4.1.3 Incorporating systematic effects 

The parfon shower, hadronizalion, and defeclor simulalion componenfs of fhe simulalion narrative are 
based on phenomenological models lhal have many adjuslable paramelers. These paramefers are nui¬ 
sance paramelers included in our master lisl of parameters a. The changes in fhe rates and shapes 
f{x\a) due fo Ihese paramelers lead fo sysfemafic uncerlainfie^^ We have already eluded fo how one 
can deal wifh fhe presence of nuisance paramelers in hypolhesis fesling and confidence infervals, buf here 
we are focusing on fhe modeling slage. In principle, we deal wifh modeling of fhese nuisance parameters 
in fhe same way as fhe physics paramelers, which is fo generate Monte Carlo samples for several choices 
of fhe paramelers {cti} and fhen use some interpolation slralegy fo form a conlinuous paramelrizalion 
for u{a), f{x\Q.), and fsim(T’|Q;). In practice, Ihere are many nuisance parameters associated fo fhe 
parfon shower, hadronizalion, and defeclor simulation so fhis becomes a mulli-dimensional inlerpolalion 
problerrp^ This is one of fhe mosl severe challenges for fhe simulalion narrafive. 

Typically, we don’f map oul fhe correlated effecl of changing mulfiple ap simullaneously. Inslead, 
we have some nominal sellings for fhese paramelers and fhen vary each individual paramefer ‘up’ 
and ‘down’ by some reasonable amounl a^. So if we have Np parameters we typically have 1 -|- 2Np 
variations of fhe Monte Carlo sample from which we fry fo form fsim(T’|Q:)- This is clearly nol an ideal 
silualion and if is nol hard fo imagine cases where fhe combined effecl on fhe rate and shapes cannof be 
factorized in terms of changes from the individual parameters. 

What is meant by “vary each individual parameter ‘up’ and ‘down’ by some reasonable amount” in 
the paragraph above? The nominal choice of the parameters is usually based on experience, test beam 

'^Note that H -y WW interferes with “background-only” WW scattering process. For low Higgs boson masses, the narrow 
Higgs width means this interference is negligible. However, at high masses the interference effect is significant and we should 
really treat these two processes together as a single sample. 

'‘'Systematic uncertainty is arguably a better term than systematic error. 

'^This is sometimes referred to as ‘template morphing’ 


17 



studies, Monte Carlo ‘tunings’, etc.. These studies correspond to auxiliary measurements in the language 
used in Sec. |2.2| and Sec. |2.4[ Similarly, these parameters typically have some maximum likelihood 


estimates and standard uncertainties from the auxiliary measurements as described in Sec. 3.1 Thus our 
complete model ftotC^lct) of Eq.[^ should not only deal with parametrizing the effect of changing each 
ap but also include either a constraint term fp{ap\ap) or an additional channel that describes a more 
complete probability model for the auxiliary measurement. 

Below we will consider a specific interpolation strategy and a few of the most popular conventions 
for constraint terms. However, before moving on it is worth emphasizing that while, naively, the matrix 
element associated to a perturbative scattering amplitude has no free parameters (beyond the physics 
parameters discussed above), fixed order perturbative calculafions do have residual scale dependence. 
This fype of theoretical uncertainty has no auxiliary measuremenf associated wifh if even in principle, 
thus it really has no frequentist description. This was discussed briefly in Sec. 2.4 In confrasf, fhe parfon 
densify functions are fhe resulfs of auxiliary measuremenfs and fhe groups producing fhe parfon densify 
funcfion sefs spend fime providing sensible multivariate consfrainf terms for fhose paramefers. However, 
those measurements also have uncertainties due to parametrization choices and theoretical uncertainties, 
which are not statistical in nature. In short we must take care in ascribing constraint terms to theoretical 
uncertainties and measurements that have theoretical uncertainties^ 


4.1.4 Tabulating the effect of varying sources of uncertainty 

The treatment of systematic uncertainties is subtle, particularly when one wishes to take into account 
the correlated effect of multiple sources of systematic uncertainty across many signal and background 
samples. The most important conceptual issue is that we separate the source of the uncertainty (for 
instance the uncertainty in the calorimeter’s response to jets) from its effect on an individual signal or 
background sample (eg. the change in the acceptance and shape of a fE+jets background). In particular, 
the same source of uncertainty has a different effect on the various signal and background samples. 
The effect of these ‘up’ and ‘down’ variations about the nominal predictions i's{oP) and fsb{x\cy.^) is 
quantified by dedicated sfudies. The resulf of fhese sfudies can be arranged in fables like fhose below. 
The main purpose of fhe HistFactory XML schema is fo represenf fhese fables. And HistFactory is a 
fool fhaf can convert fhese fables info our master model ftot(^|Q:) of Eq. [^implemenfed as a RooAbsPdf 
wifh a ModelConf ig fo make if compatible wifh RooStats fools. The convention used by HistFactory 
is relafed fo our nofafion via 

Us{a)fs{x\a) = r]s{oi)as{x\a.) ( 22 ) 

where r]s{a) represenfs relative changes in fhe overall rate u{a) and cJs(x|q;) includes bofh changes 
fo fhe rate and fhe shape f{x\a.). This choice is one of convenience because histograms are often nol 
normalized fo unify, buf insfead in code rafe informafion. As fhe name implies, HistFactory works 
wifh histograms, so insfead of writing as{x\cx) fhe fable is written as asb{oL), where 5 is a bin index. 
To compress fhe nofafion furfher, and represenf fhe value of when Up = and all ofher 

paramefers are fixed fo fheir nominal values. Thus we arrive af fhe following fabular form for models 
builf on fhe simulafion narrative based on hisfograms wifh individual nuisance parameters varied one af 
a fime: 


4.1.5 Interpolation Conventions 

Eor each sample, one can inferpolafe and exfrapolafe from fhe nominal prediction = 1 and fhe vari¬ 
ations rj^g fo produce a paramefrized r]s{a). Similarly, one can inferpolafe and exfrapolafe from fhe 
nominal shape and fhe variations fo produce a paramefrized asb{oi)- We choose fo paramefrize 
Up such fhaf Op = 0 is fhe nominal value of fhis parameter, Op = ±1 are fhe “±lcj variafions”. Need¬ 
less fo say, fhere is a significanl amounf of ambiguify in fhese interpolation and exfrapolafion proce- 

'®“Note that I deliberately called them theory errors, not uncertainties.” - Tilman Plehn 
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Syst 

Sample 1 

Sample N 

Nominal Value 

Vs=l = 1 

Vs=N = 1 

p=OverallSys 1 

^p=l,s=l’ ^p=l,s=l 

Vp=l^s=N’ Vp=l^s=N 

p=OverallSys M 

^p=M,s=l^ ^p=M,s=l 

'np=M,s=N’ Vp=Ivls=N 

Net Effect 

Vs=i{a) 

r]s=N{0L) 


Table 1: Tabular representation of sources of uncertainties that produce a correlated effect in the normalization 
individual samples (eg. OverallSys). The represent histogram when ag = 1 and are inserted into the High 
attribute of the OverallSys XML element. Similarly, the t ]~^ represent histogram when = — 1 and are inserted 
into the Low attribute of the OverallSys XML element. Note, this does not imply that 77 + > 77 “, the ± superscript 
correspond to the variation in the source of the systematic, not the resulting effect. 


Syst 

Sample 1 

Sample N 

Nominal Value 


^s=N,b 

p=HistoSys 1 

^p=l,s=l,b’ ^p=l^s=l^b 

%=l,s=7V,b’ ^p=l,s=N,b 

p=HistoSys M 

^p=M,s=l,b’ ^p=M,s=l,b 

^p=M,s=N,b’ ^p=M,s=N,b 

Net Effect 

^s=l,b{^) 

(^s=N,b{0-) 


Table 2: Tabular representation of sources of uncertainties that produce a correlated effect in the normalization 
and shape individual samples (eg. HistoSys ). The represent histogram when = 1 and are inserted into 
the HighHist attribute of the HistoSys XML element. Similarly, the represent histogram when = — 1 
and are inserted into the LowHist attribute of the HistoSys XML element. 


dures and they must be handled with care. Bellow are some of the interpolation strategies supported by 
HistFactory. These are all ’vertical’ style interpolation treated independently per-bin. Four interpola¬ 
tion strategies are described below and can be compared in Fig|^ The interested reader is invited to look 
at alternative ’horizontal’ interpolation strategies, such as the one developed by Alex Read in Ref. 1141 
(the RooFit implementation is called RooIntegralMorph) and Max Baak’s RooMomentMorph. These 
horizontal interpolation strategies are better suited for features moving, such as the location of an invari¬ 
ant mass bump changing with the hypothesized mass of a new particle.. 


Piecewise Linear (InterpCode=0) 

The piecewise-linear interpolation strategy is defined as 

risioL) = l+ /iin.(ap;l,r7+, r]Jp) 

pSSyst 

and for shape interpolation it is 


with 


asb{a) = a^,h+ Y 

pSSyst 




a(/+ - 7°) a > 0 
— I~) a < 0 


(23) 


(24) 


(25) 


Pros: This approach is the most straightforward of the interpolation strategies. 

Cons: It has two negative features. First, there is a kink (discontinuous first derivative) at a = 0 
(see Fig|^b-d)), which can cause some difficulties for numerical minimization packages such as Minuit. 
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Second, the interpolation factor can extrapolate to negative values. For instance, if rj = 0.5 then we 
have 77 (a) < 0 when a < —2 (see Figj^c)). 

Note that one could have considered the simultaneous variation of ap and a^/ in a multiplicative 
way (see for example. Fig ??). The multiplicative accumulation is not an option currently. 

Note that this is the default convention for asb{ct) (ie. HistoSys ). 


Piecewise Exponential (InterpCode=l) 

The piecewise exponential interpolation strategy is defined as 

Vs{oc)= n ^exp.iap'A,vtp, v7p) ( 26 ) 

pSSyst 


and for shape interpolation it is 


with 


— ^sb 2exp. 

pSSyst 


/exp.(a;/°,/+,/-) = 


{oip](T^sbi^psb'’ ^psb) 

{i^/hT « > 0 

a <0 


(27) 


(28) 


Pros: This approach ensures that 77 ( 0 ) > 0 (see Fig[^c)) and for small response to the uncer¬ 
tainties it has the same linear behavior near a ~ 0 as the piecewise linear interpolation (see Fig[^a)). 

Cons: It has two negative features. First, there is a kink (discontinuous first derivative) at a = 0, 
which can cause some difficulties for numerical minimization packages such as Minuit. Second, for 
large uncertainties it develops a different linear behavior compared to the piecewise linear interpolation. 
In particular, even if the systematic has a symmetric response (ie. 77+ — 1 = 1 — 77“) the interpolated 
response will develop a kink for large response to the uncertainties (see Fig|^c)). 

Note that the one could have considered the simultaneous variation of Up and Up/ in an additive 
way, but this is not an option currently. 

Note, that when paired with a Gaussian constraint on a this is equivalent to linear interpolation and 
a log-normal constraint in ln(a). This is the default strategy for normalization uncertainties 7 /^( 0 :) (ie. 
OverallSys ) and is the standard convention for normalization uncertainties in the LHC Higgs Com¬ 
bination Group. In the future, the default may change to the Polynomial Interpolation and Exponential 
Extrapolation described below. 


Polynomial Interpolation and Exponential Extrapolation (InterpCode=4) 

The strategy of this interpolation option is to use the piecewise exponential extrapolation as above 
with a polynomial interpolation that matches r]{a = ±ao), drj/da\a=±ao, and d‘^ri/da‘^\a=±ao and the 
boundary ±ao is defined by the user (with default ao = 1). 


with 


7poiy|gxp.(Q:p, 1,77gp, Tjgp^cxo) 

pSSyst 


7poly|exp. (®) d , I , I , ag) 


'(/+/7o)“ «>ao 

< 1 + aiO^ |a| < ao 
a <-ao 


and the a* are fixed by the boundary conditions described above. 


(29) 


(30) 
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Pros: This approach avoids the kink (discontinuous first and second derivatives) at a = 0 (see 
Fig[^b-d)), which can cause some difficulties for numerical minimization packages such as Minuit. 
This approach ensures that ri{a) > 0 (see Figj^c)). 

Note: This option is not available in ROOT 5.32.00, but is available for normalization uncertainties 
(OverallSys) in the subsequent patch releases. In future releases, this may become the default. 




(a) (b) 



(c) (d) 

Fig. 6: Comparison of the three interpolation options for different rf^. (a) rj~ = 0.8, ry'*' = 1.2, (b) 77 “ = 1.1, 
T]~^ — 1.5, (c) 77 “ = 0.2, Tj'^ = 1.8, and (d) ri~ — 0.95, rj'^ = 1.5 


4.1.6 Consistent Bayesian and Frequentist modeling 

The variational estimates 77 ^ and cj^ typically correspond to so called “Tier variations” in the source of 
the uncertainty. Here we are focusing on the source of the uncertainty, not its affect on rates and shapes. 
For instance, we might say that the jet energy scale has a 10% uncertainty. This is common jargon, 
but what does it mean? The most common interpretation of this statement is that the uncertain parameter 
ap (eg. the jet energy scale) has a Gaussian distribution. However, this way of thinking is manifestly 
Bayesian. If the parameter was estimated from an auxiliary measurement, then it is the PDF for that 

'^Without loss of generality, we choose to parametrize ap such that Op = 0 is the nominal value of this parameter, Op = ±1 
are the “ilcr variations”. 
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measurement that we wish to include into our probability model. In the frequentist way of thinking, the 
jet energy scale has an unknown true value and upon repeating the experiment many times the auxiliary 
measurements estimating the jet energy scale would fluctuate randomly about this true value. To aid in 
this subtle distinction, we use greek letters for the parameters (eg. Up) and roman letters for the auxiliary 
measurements Op. Furthermore, we interpret the “zhlcj” variation in the frequentist sense, which leads to 
the constraint term /p(ap|Q;p). Then, we can pair the resulting likelihood with some prior on Up to form 
a Bayesian posterior if we wish according to Eq.|^ 


It is often advocated that a “log-normal” or “gamma” distribution for Up is more appropriate 
than a gaussian constraint 1151. This is particularly clear in the case of bounded parameters and large 
uncertainties. Here we must take some care to build a probability model that can maintain a consistent 
interpretation in Bayesian a frequentist settings. Table summarizes a few consistent treatments of the 
frequentist pdf, the likelihood function, a prior, and the resulting posterior. 


PDF 

G{cLp\oip, dp) 

Pois(rep|rp/3p) 
Pln(^pI/3p) dp) 
-Pln(^pI/3p) ^p) 


Likelihood oc 

G{o!p\cLp, (Tp) 

Pr(/3p|^ = Tp]B = 1 + Up) 
f^p ■ Pln(/3pI^p) '^p) 

/5p ■ PLN(/3p|rip, dp) 


Prior TTo 
7ro(Q;p) oc const 
7ro(/3p) oc const 
7ro(/3p) oc const 
7ro(^p) oc l//3p 


Posterior vr 

G{o.p\ojp^ dp) 

Pr(/3p|^ = Tp; P = 1 + Up) 

Pln(/5pI^P) cTp) 

Pln(/3pI^P) <7p) 


Table 3: Table relating consistent treatments of PDF, likelihood, prior, and posterior for nuisance parameter con¬ 
straint terms. 


Finally, it is worth mentioning that the uncertainty on some parameters is not the result of an auxil¬ 
iary measurement - so the constraint term idealization, it is not just a convenience, but a real conceptual 
leap. This is particularly true for theoretical uncertainties from higher-order corrections or renormal- 
izaiton and factorization scale dependence. In these cases a formal frequentist analysis would not include 
a constraint term for these parameters, and the result would simply depend on their assumed values. As 
this is not the norm, we can think of reading Table from right-to-left with a subjective Bayesian prior 
7r(Q;) being interpreted as coming from a fictional auxiliary measurement. 


4.1.6.1 Gaussian Constraint 

The Gaussian constraint for ap corresponds to the familiar situation. It is a good approximation of 
the auxiliary measurement when the likelihood function for Op from that auxiliary measurement has a 
Gaussian shape. More formally, it is valid when the maximum likelihood estimate of ap (eg. the best fit 
value of Op) has a Gaussian distribution. Here we can identify the maximum likelihood estimate of Op 
with the global observable Op, remembering that it is a number that is extracted from the data and thus 
its distribution has a frequentist interpretation. 


G(^(ip\ap, (Tp) 



(31) 


with cTp = 1 by default. Note that the PDF of Op and the likelihood for Op are positive for all values. 


4.1.6.2 Poisson (“Gamma”) constraint 

When the auxiliary measurement is actually based on counting events in a control region (eg. a Poisson 
process), a more accurate to describe the auxiliary measurement with a Poisson distribution. It has been 
shown that the truncated Gaussian constraint can lead to undercoverage (overly optimistic) results, which 
makes this issue practically relevant Table shows that a Poisson PDF together with a uniform prior 
leads to a gamma posterior, thus this type of constraint is often called a “gamma” constraint. This is a 
bit unfortunate since the gamma distribution is manifestly Bayesian and with a different choice of prior. 
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one might not arrive at a gamma posterior. When dealing with the Poisson constraint, it is no longer 
convenient to work with our conventional scaling for Op which can be negative. Instead, it is more 
natural to think of the number of events measured in the auxiliary measurement Up and the mean of the 
Poisson parameter. This information is not usually available, instead one usually has some notion of the 
relative uncertainty in the parameter <7^®' (eg. a the jet energy scale is known to 10%). In order to give 
some uniformity to the different uncertainties of this type and think of relative uncertainty, the nominal 
rate is factored out into a constant Tp and the mean of the Poisson is given by TpUp. 


Pois(np|rpap) 


{Tpap)^p e 


n„'. 


(32) 


Here we can use the fact that Var [np] = y/TpUp and reverse engineer the nominal auxiliary measurement 


= (1/^?')^ ■ (33) 

where the superscript 0 is to remind us that np will fluctuate in repeated experiments but n® is the value 
of our measured estimate of the parameter. 

One important thing to keep in mind is that there is only one constraint term per nuisance pa¬ 
rameter, so there must be only one cjp®^ per nuisance parameter. This cip®^ is related to the fundamental 
uncertainty in the source and we cannot infer this from the various response terms or 

Another technical difficulty is that the Poisson distribution is discrete. So if one were to say the 
relative uncertainty was 30%, then we would find = 11.11..., which is not an integer. Rounding np 
to the nearest integer while maintaining Tp = (1/(7^®*)^ will bias the maximum likelihood estimate of ap 
away from 1. To avoid this, one can use the gamma distribution, which generalizes more continuously 
with 

PT{ap\A = Tp,B = np-l) = A{Aap)^e-^^p/T{B) . (34) 

This approach works fine for likelihood fits, Bayesian calculations, and frequentist techniques based on 
asymptotic approximations, but it does not offer a consistent treatment of the pdf for the global observable 
Up that is needed for techniques based on Monte Carlo sampling. 


4.1.6.3 Log-normal constraint 

From Eadie et ah, “The log-normal distribution represents a random variable whose logarithm follows a 
normal distribution. It provides a model for the error of a process involving many small multiplicative 
errors (from the Central Limit Theorem). It is also appropriate when the value of an observed variable is a 
random proportion of the previous observation.” | T5p6 1. This logic of multiplicative errors applies to the 
the measured value, not the parameter. Thus, it is natural to say that there is some auxiliary measurement 
(global observable) with a log-normal distribution. As in the gamma/Poisson case above, let us again say 
that the global observable is rip with a nominal value 


nj = T, = (l/<r“')= 


Then the conventional choice for the corresponding log-normal distribution is 




1 


1 


In K np 

while the likelihood function is (blue curve in Fig.|7ja)). 


exp 


ln(np/a p) 
2 (In Kp 


2n 


(35) 


(36) 


LiyOtp^ - 


In K nj 


exp 


ln(np/Q;p)^ 

2(ln Kp)^ 


(37) 


23 













(38) 


To get to the posterior for Up given Up we need an ur-prior r]{ap) 


Tr{ap) oc r]{ap) 


1 1 

-^= -exp 

V 27r In K np 


ln(np/ap)^ 

2(ln 


If r]{ap) is uniform, then the posterior looks like the red curve in Fig. |^b). However, when paired with 
an “ur-prior” ri{ap) oc 1/ap (green curve in Fig. ^b)), this results in a posterior distribution that is also 
of a log-normal form for Up (blue curve in Fig. [7 b)). 




Fig. 7: The lognormal constraint term; (left) the pdf for the global observable Op and (right) the likelihood function, 
the posterior based on a flat prior on Op, and the posterior based on a 1 /op prior. 


4.1.7 Incorporating Monte Carlo statistical uncertainty on the histogram templates 

The histogram based approach described above are based Monte Carlo simulations of full detector sim¬ 
ulation. These simulations are very computationally intensive and often the histograms are sparsely 
populated. In this case the histograms are not good descriptions of the underlying distribution, but are 
estimates of that distribution with some statistical uncertainty. Barlow and Beeston outlined a treatment 
of this situation in which each bin of each sample is given a nuisance parameter for the true rate, which 
is then fit using both the data measurement and the Monte Carlo estimate p7| . This approach would 
lead to several hundred nuisance parameters in the current analysis. Instead, the HistFactory employs 
a lighter weight version in which there is only one nuisance parameter per bin associated with the total 
Monte Carlo estimate and the total statistical uncertainty in that bin. If we focus on an individual bin 
with index b the contribution to the full statistical model is the factor 

Pois(nb|t';,(Q;) -h 7 feZ^^‘^(Q:)) Pois(mfe| 7 bTb) , (39) 

where nj, is the number of events observed in the bin, Vh{oL) is the number of events expected in the 
bin where Monte Carlo statistical uncertainties need not be included (either because the estimate is 
data driven or because the Monte Carlo sample is sufficiently large), z^^^(q) is the number of events 
estimated using Monte Carlo techniques where the statistical uncertainty needs to be taken into account. 
Both expectations include the dependence on the parameters a.. The factor is the nuisance parameter 
reflecting that the true rate may differ from the Monte Carlo estimate by some amount. If 

the total statistical uncertainty is 6h, then the relative statistical uncertainty is given by This 

corresponds to a total Monte Carlo sample in that bin of size mb = Treating the Monte Carlo 

estimate as an auxiliary measurement, we arrive at a Poisson constraint term Pois(mb| 7 brb), where mi, 
would fluctuate about 7b Tb if we generated a new Monte Carlo sample. Since we have scaled 7 to be a 
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factor about 1 , then we also have Tb = /dbf", however, Tb is treated as a fixed constant and does not 

fluctuate when generating ensembles of pseudo-experiments. 

It is worth noting that the conditional maximum likelihood estimate "fbioc) can be solved analyti¬ 
cally with a simple quadratic expression. 


J , , -B + VH2 - AAC 

(40) 


(41) 

iyb{a)T + Ub{oc)i'^^{a) - UbU^^ia) - mbi^^^{oc) 

(42) 

C = mbVb{a.) . 

(43) 


In a Bayesian technique with a flat prior on 76 , the posterior distribution is a gamma distribution. 
Similarly, the distribution of % will take on a skew distribution with an envelope similar to the gamma 
distribution, but with features reflecting the discrete values of mb- Because the maximum likelihood 
estimate of 7 ^ will also depend on Ub and a, the features from the discrete values of mb will be smeared. 
This effect will be more noticeable for large statistical uncertainties where Tb is small and the distribution 
of 76 will have several small peaks. For smaller statistical uncertainties where Tb is large the distribution 
of 76 will be approximately Gaussian. 


4.2 Data-Driven Narrative 


The strength of the simulation narrative lies in its direct logical link from the underlying theory to the 
modeling of the experimental observations. The weakness of the simulation narrative derives from the 
weaknesses in the simulation itself. Data-driven approaches are more motivated when they address 
specific deficiencies in the simulation. Before moving to a more abstract or general discussion of the 
data-driven narrative, let us first consider a few examples. 


The first example we have already considered in Sec. 2.2 in the context of the “on/off” problem. 
There we introduced an auxiliary measurement that counted ncR events in a control region to estimate 
the background ub in the signal region. In order to do this we needed to understand the ratio of the num¬ 
ber of events from the background process in the control and signal regions, r. This ratio r either comes 
from some reasonable assumption or simulation. For example, if one wanted to estimate the background 
due to jets faking muons j —)• /r for a search selecting , then one might use a sample of 

events as a control region. Here the motivation for using a data-driven approach is that modeling the 
processes that lead to j —)■ /r rely heavily on the tails of fragmentation functions and detector response, 
which one might reasonably have some skepticism. If one assumes that control region is expected to 
have negligible signal in it, that backgrounds that produce other than the jets faking muons, and 

that the rate for j —)■ /i” is the samej^as the rate for j —)• then one can assume r = 1. Thus, this 

background estimate is as trustworthy as the assumptions that went into it. In practice, several of these 
assumptions may be violated. Another approach is to use simulation of these background processes to 
estimate the ratio r; a hybrid of the data-driven and simulation narratives. 


Let us now consider the search for Ff —)• 77 shown in Fig[ 8 ] p^|T^ . The right plot of Fig shows 
the composition of the backgrounds in this search, including the continuum production of pp —)• 77 , 
the 7 -f-jets process with a jet faking a photon j —)• 7 , and the multi jet process with two jets faking 
photons. The continuum production of 77 has a theoretical uncertainty that is much larger than the sta¬ 
tistical fluctuations one would expect in the data. Similarly, the rate of jets faking photons is sensitive 
to fragmentation and the detector simulation. These uncertainties are large compared to the statistical 
fluctuations in the data itself. Thus we can use the distribution in Fig to measure the total background 


'*Given that the LHC collides pp and not pp, there is clearly a reason to worry if this assumption is valid. 
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rate. Of course, the signal would also be in this distribution, so one either needs to apply a mass win¬ 
dow around the signal and consider the region outside of the window as a sideband control sample or 
model the signal and background contributions to the distribution. In the case of the if —77 shown 
in Fig [8 T 9 | the modeling of the distribution signal and background distributions is not based on 

histograms from simulation, but instead a continuous function is used as an effective model. I will dis¬ 
cuss this effective modeling narrative below, but point out that here this is another example of a hybrid 
narrative. 



Fig. 8: Distribution of diphoton invariant mass distributions in the ATLAS if —77 search. The left plot shows a 
fit of a an effective model to the data and the right plot shows an estimate of the 77, 7-Hjet, and diet contributions. 


The final example to consider is an extension of the ‘on/off’ model, often referred to as the ‘ABCD’ 
method. Let us start with the ‘on/off’ model: Pois(nsR|i^5 -b vb) • Pois(ncR|Ti/s). As mentioned 
above, this requires that one estimate r either from simulation or through some assumptions. The ABCD 
method aims to estimate introduce two new control regions that can be used to measure r. To see this, 
let us imagine that the signal and control regions correspond to requiring some continuous variable x 
being greater than or less than some threshold value Xc- If we could introduce a second discriminating 
variable y such that the distribution for background factorizes /b(x, y) = fB{x)fB{y), then we have a 
handle to measure the factor r. Typically, one introduces a threshold yc so that the signal contribution 
is small below this thresholc^ Figure shows an example where Xc = yc = 5 . With this we these 
two thresholds we have four regions that we can schematically refer to as A, B, C, and D. In the case of 
simply counting events in these regions we can write the total expectation as 



— 1 • /r + T 1 

( 44 ) 

I'B 

= + tb^a 



= ecy + + TCI'A 



= eofi + + tbtcva 



where // is the signal rate in region A, e* is the ratio of the signal in the regions B, C, D with respect to the 
signal in region A, is the rate of background in each of the regions being estimated from simulation, 
Ui is the rate of the background being estimated with the data driven technique in the signal region, and r* 
are the ratios of the background rates in the regions B, C, and D with respect to the background in region 
A. The key is that we have used the factorization /s(x, y) = fB{x)fB{y) to write tb = tbtc- The right 
panel of Fig. shows a more complicated extension of the ABCD method from a recent ATLAS SUSY 
analysis | [20t . 

'®The relative sign of the cut is not important, but has been chosen for consistency with Figj^ 


26 












d 

An alternative parametrization, which can be more numerically stable is 



= 1- n + + rjcrjBVD 

(45) 

VB 

= es/x -F -|- TjBV£) 


T^C 

= ec/i + + VC^D 


l^D 

= e/5/x -F -F 1 • yn 
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Fig. 9 : An example of ABCD (from Alex Read) in the x — y plane of two observables x and y (left). A more 
complex example with several regions in the plane |^. 


4.3 Effective Model Narrative 

In the simulation narrative the model of discriminating variable distributions f{x\a) is derived from 
discrete samples of simulated events {xi,..., xtv}. We discussed above how one can use histograms or 
kernel estimation to approximate the underlying distribution and interpolation strategies to incorporate 
systematic effects. Another approach is to assume some parametric form for the distribution to serve as an 
effective model. For example, in the iF —)• 77 analysis shown in Fig.j^a simple exponential distribution 
was used to model the background. The state-of-the-art theoretical predictions for the continuum 77 
background process do not predict exactly an exponentially falling distribution, and the analysis must 
(and does) incorporate the systematic associated to the effective model. Similarly, it is common to use 
a polynomial in some limited sideband region to estimate backgrounds under a peak. These effective 
models can range from very ad hoc to more motivated. For instance, one might use knowledge of 
kinematics and phase space and/or detector resolution to construct an effective model that captures the 
relevant physics. The advantage of a well motivated effective model is that few nuisance parameters 
may describe well the relevant family of probability densities, which is the challenge for generic (and 
relatively unsophisticated) interpolation strategies usually employed in the simulation narrative. 

4.4 The Matrix Element Method 

Ideally, one would not use a single discriminating variable to distinguish the process of interest from 
the other background processes, but instead would use as much discriminating power as possible. This 
implies forming a probability model over a multi-dimensional discriminating variable (ie. a multivariate 

^'’For instance, the modeling of iF —>■ ZZ^*^ —>■ 4Z described in (see Eq. 2 of the corresponding section) 
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analysis technique). In principle, both the histogram-based and kernel-based approach generalize to 
distributions of multi-dimensional discriminating variables; however, in practice, they are limited to only 
a few dimensions. In the case of histograms this is particularly severe unless one employs clever binning 
choices, while in the kernel-based approach one can model up to about 5-dimensional distributions with 
reasonable Monte Carlo sample sizes. In practice, one often uses multivariate algorithms like Neural 
Networks or boosted decision tree^to map the multiple variables into a single discriminating variable. 
Often these multivariate techniques are seen as somewhat of a black-box. If we restrict ourselves to 
discriminating variables associated with the kinematics of final state particles (as opposed to the more 
detailed signature of particles in the detector), then we can often approximate he detailed simulation of 
the detector with a parametrized detector response. If we denote the kinematic configuration of all the 
final slate particles in Ihe Lorenlz invarianl phase space as <I>, Ihe initial stale as i, Ihe malrix elemenl 
(polenlially averaged over unmeasured spin configurations) as Ad (i, <I>), and the probability due to parton 
density functions for the initial state i going into the hard scattering as /(f), then we can write that the 
distribution of the, possibly multi-dimensional, discriminating variable x as 

f{x)(x jd<^>f{i)\M{i,<^>)\‘^W{x\<^>), (46) 

where IC(a:|<I>) is referred to as the transfer function of x given the final state configuration 4>. It is 
natural to think of W{x\^) as a conditional distribution, but here I let W encode the efficiency and 
acceptance so that we have 

^ ^ JdxJd^\M{i,n‘'W{x\^) 
a f d^ <h)p 

Otherwise, the equation above looks like another application one Bayes’s theorem where W (xj^h) plays 
the role of the pdf/likelihood function and Ad(i, <h) plays the role of the prior over the $. It is worth 
pointing out that this is a frequentist use of Bayes’s theorem since is the Lorentz invariant phase space 
which explicitly has a measure associated with it. 


4.5 Event-by-event resolution, conditional modeling, and Punzi factors 

In some cases one would like to provide a distribution for the discriminating variable x based conditional 
on some other observable in the event y. f{x\oL, y). For instance, one might want to say that the en¬ 
ergy resolution for electrons depends on the energy itself through a well-known calorimeter resolution 
parametrization like a{E)/E = Aj^fE © B. These types of conditional distributions can be built in 
RooFit. A subtle point studied by Punzi is that if f{y\a) depends on o: the inference on a can be bi¬ 
ased |231. In particular, if one is trying to estimate the amount of signal in a sample and the distribution 
of y for the signal is different than for the background, the estimate of the signal fraction will be biased. 
This can be remedied by including terms related to f{y\Q.), colloquially called ‘Punzi Factors’. Impor¬ 
tantly, this means one cannot build conditional models like this without knowing or assuming something 
about f{y\cx). 


5 Frequentist Statistical Procedures 

Here I summarize the procedure used by the LHC Higgs combination group for computing frequentist 
p-values uses for quantifying the agreement with the background-only hypothesis and for determining 
exclusion limits. The procedures are based on the profile likelihood ratio test statistic. 

The parameter of interest is the overall signal strength factor //, which acts as a scaling to the total 
rate of signal events. We often write p = ajasM, where asM is the standard model production cross- 
section; however, it should be clarified that the same y factor is used for all production modes and could 

A useful toolkit for high-energy physics is TMVA, which is packaged with ROOT j22|. 
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also be seen as a scaling on the branching ratios. The signal strength is called so that /r = 0 corresponds 
to the background-only model and /r = 1 is the standard model signal. It is convenient to separate the 
full list of parameters cx into the parameter of interest ^ and the nuisance parameters 6: a = (/r, 0). 

For a given data set Dsim and values for the global observables Q there is an associated likelihood 
function over ^ and 9 derived from combined model over all the channels including all the constraint 
terms in Eq.[^ 

L(/i, . (48) 

The notation T(/x, 0) leaves the dependence on the data implicit, which can lead to confusion. Thus, we 
will explicitly write the dependence on the data when the identity of the dataset is important and only 
suppress Psinu G when the statements about the likelihood are generic. 

We begin with the definition of the procedure in the abstract and then describe three implementa¬ 
tions of the method based on asymptotic distributions, ensemble tests (Toy Monte Carlo), and importance 
sampling. 


5,1 The test statistics and estimators of /r and 6 

This definitions in this section are all relative to a given dataset Psim and value of the global observables 
Q, thus we will suppress their appearance. The nomenclature follows from Ref. Q. 

The maximum likelihood estimates (MLEs) jl and 6 and the values of the parameters that max¬ 
imize the likelihood function 6) or, equivalently, minimize — lnL(/x, 6). The dependence of the 
likelihood function on the data propagates to the values of the MEEs, so when needed the MEEs will 
be given subscripts to indicate the data set used. Eor instance, 0obs is the MEE of 0 derived from the 
observed data and global observables. 

The conditional maximum likelihood estimate (CMEEs) 0(/i) is the value of 6 that maximizes 
the likelihood function with /r fixed; if can be seen as a mulfidimensional funcfion of fhe single variable 
/i. Again, fhe dependence on and G is implicif. This procedure for choosing specihc values of fhe 
nuisance paramefers for a given value of /x, 'Dsim, and Q is oflen referred fo as “profiling”. Similarly, 
6{n) is oflen called “fhe profiled value of 6”. 

Given fhese definifions, we can construcf fhe profile likelihood rafio 


\{^l) 




(49) 


which depends explicifly on fhe paramefer of inferesf /x, implicifly on fhe dafa Psim and global observ¬ 
ables G, and is independenf of fhe nuisance paramefers 9 (which have been eliminaled via “prohling”). 

In any physical fheory fhe rale of signal evenls is non-negafive, Ihus /x > 0. However, if is oflen 
convenienl lo allow /x < 0 (as long as fhe pdf /c(xc|/x, 0) > 0 everywhere). In parlicular, /x < 0 indicates 
a deficil of evenls signal-like wilh respecf lo fhe background only and fhe boundary al /x = 0 complicates 
fhe asymplofic dislribufions. Ref. Q uses a Irick fhal is equivalenl lo requiring /x > 0 while avoiding 
the formal complications of a boundary, which is to allow /x < 0 and impose the constraint in the test 
statistic itself. In particular, one defines A(/i) 


A(/x) 


L{M) 

< 


/x > 0, 
/x < 0 


(50) 


This is not necessary when ensembles of pseudo-experiments are generated with “Toy” Monte Carlo 
techniques, but since they are equivalent we will write A to emphasize the boundary at /x = 0. 


29 






(c) 


(d) 


Fig. 10: Visualization of a two dimensional likelihood function —2 In 6). The blue line in the plane represents 
the profiling operation 0(/i) and the blue curve along the likelihood surface represents —2 In A(/i). Note it is was 
to show that the blue line exits the contours of —21nL(/i, 9) when they are perpendicular to the /r axis, which 
provides the correspondence between the profile likelihood ratio and the description of the Minos algorithm. 


For discovery the test statistic % is used to differentiate the background-only hypothesis ^ = 0 
from the alternative hypothesis ^ > 0: 


% 


—21nA(^) /t > 0 

0 ft < 0 


(51) 


Note that % is test statistic for a one-sided alternative. Note also that if we consider the parameter of 
interest ^ > 0, then it is equivalent to the two-sided test (because there are no values of /i less than 

/i = 0. 

For limit setting the test statistic is used to differentiate the hypothesis of signal being produced 
at a rate // from the alternative hypothesis of signal events being produced at a lesser rate /r' < fi: 


9m 


—2 In A(^) fi < 

0 A > F 


mem 

m,e) 


A < 0, 

0 < A ^ F ) 

fl> n. 


(52) 
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Note that is a test statistie for a one-sided alternative; it is a test statistie for a one-sided upper limit. 

The test statistie is used to differentiate signal being produeed at a rate /x from the alternative 
hypothesis of signal events being produeed at a lesser or greater rate /x' / 

f^ = -21nA(^). (53) 

Note that is a test statistie for a two-sided alternative (as in the ease of the Feldman-Cousins teehnique, 
though this is more general as it ineorporates nuisanee parameters). Note that if we eonsider the parame¬ 
ter of interest /U > 0 and we the test at /x = 0 then there is no “other side” and we have = %. Finally, 
if one relaxes the eonstraint /x > 0 then the two-sided test statistie is written or, simply, —2 In A(/x). 

5.2 The distribution of the test statistic and p-values 

The test statistie should be interpreted as a single real-valued number that represents the outeome of the 
experiment. More formally, it is a mapping of the data to a single real-valued number: : "Dsim, ^ M. 

For the observed data the test statistie has a given value, eg. g^^obs- If one were to repeat the experiment 
many times the test statistie would take on different values, thus, eoneeptually, the test statistie has 
a distribution. Similarly, we ean use our model to generate pseudo-experiments using Monte Carlo 
teehniques or more abstraetly eonsider the distribution. Sinee the number of expeeted events v{ii, 6) and 
the distributions of the diseriminating variables /c(xc|/x, 0) explieitly depend on 6 the distribution of the 
test statistie will also depend on 6. Let us denote this distribution 

(54) 

and we have analogous expressions for eaeh of the test statistics described above. 

The p-value for a given observation under a particular hypothesis (p, 0) is the probability for an 
equally or more ‘extreme’ outcome than observed assuming that hypothesis 

poo 

P^i,e= f{qf,\fi,e)dqf, . (55) 

9/i,obs 

The logic is that small p-values are evidence against the corresponding hypothesis. In Toy Monte Carlo 
approaches, the integral above is really carried out in the space of the data f dD^imdG. 

The immediate difficulty is that we are interested in ji but the p-values depend on both p and 6. In 
the frequentist approach the hypothesis p = po would not be rejected unless the p-value is sufficiently 
small for all values of 6. Equivalently, one can use the supremum p-value for over all 6 to base the 
decision to accept or reject the hypothesis at p = po- 

p™P = sup (56) 

e 

The key conceptual reason for choosing the test statistics based on the profile likelihood rafio 
is fhaf asympfolically (ie. when fhere are many evenfs) fhe disfribufion of fhe profile likelihood rafio 
A(/x = /xtrue) is indcpcndcnf of fhe values of fhe nuisance paramefers. This follows from Wilks’s fheo- 
rem. In fhaf limif p™'’ = p^ ^ for all 6. 

The asympfofic disfribufions /(A(p)|p, 6) and /(A(p)|p', 0) are known and described in Sec. ??. 
For resulfs based on generating ensembles of pseudo-experiemenfs using Toy Monfe Carlo fechniques 
does nol assume fhe form of fhe disfribufion /(g^|p, 0), buf knowing fhaf if is approximafely independenf 
of 0 means fhaf one does nof need fo calculafe p-values for all 9 (which is nol compulalionally feasible). 
Since fhere may still be some residual dependence of fhe p-values on fhe choice of 9 we would like 
fo know fhe specific value of fhaf produces fhe supremum p-value over 9. Since larger p-values 

indicafe better agreemenl of fhe dala wilh fhe model, if is nol surprising fhaf choosing = 0(p) 
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is a good estimate of This has been studied in detail by statisticians, and is called the Hybrid 

Resampling method and is referred to in physics as the ‘profile construction’ [?,|^ 11 2^. 

Based on the discussion above, the following p-value is used to quantify consistency with the 
hypothesis of a signal strength of 


Pm = 


/ 

J a. 


/(q/,|/i,0(Ai,obs))% . 


(57) 


Q'/j,,obs 


A standard 95% confidence-level, one-sided frequenfisl confidence inferval (upper limif) is obfained by 
solving for = 5%. For downward flucluafions fhe upper limif of fhe confidence inferval can be 
arbifrarily small, fhough if will always include p = 0. This fealure is considered undesirable since a 
physicisf would nol claim sensifivify fo an arbifrarily small signal rafe. The fealure was fhe motivation 
for fhe modified frequenfisl mefhod called CLg 12^27|. and fhe alfemalive approach called power- 
conslrained limits 


(58) 


To calculate the CLg upper limit, we define p' as a ralio of p-values. 


Pu = 


Pm 


1 ’ 
1 -P6 


where ph is fhe p-value derived from fhe same lesl slalislic under fhe background-only hypolhesis 


Pfe = 1 - 


/(9m|0>^(p = 0,obs))dq^ 


(59) 


' Qfj,,obs 


The CLg upper-limif on p is denofed p^p and obfained by solving for p'^^^ = 5%. If is worth noting 
lhal while confidence intervals produced wilh fhe “CLs” mefhod over cover, a value of p is regarded 
as excluded al fhe 95% confidence level if p < pup- The amounf of over coverage is nol immediafely 
obvious; however, for small values of p fhe coverage approaches 100% and for large values of p fhe 
coverage is near fhe nominal 95% (due fo (pb) « 0). 

For fhe purposes discovery one is inleresled in compalibilily of fhe dafa wilh fhe background-only 
hypolhesis. Sfalisfically, a discovery corresponds fo rejecling fhe background-only hypolhesis. This 
compalibilily is based on fhe following p-value 


poo 

Po = / /(qo|O,0(p = 0,obs))dqo • 

Q0,ohs 


(60) 


This p-value is also based on fhe background-only hypolhesis, buf fhe lesl slalislic % is suiled for lesling 


fhe background-only while fhe lesl slalislic in Eq. 59 is suited for testing a hypolhesis wilh signal. 

If is customary to convert fhe background-only p-value into fhe quantile (or “sigma”) of a unil 
Gaussian. This conversion is purely conventional and makes no assumption lhal Ihe lesl slalislic qq is 
Gaussian dislribuled. The conversion is defined as: 


Z = ch-^(l-po); 


(61) 


where ^ is fhe inverse of fhe cumulative dislribulion for a unil Gaussian. One says Ihe significance of 
Ihe resull is Za and Ihe slandard discovery convention is 5 ct, corresponding to po = 2.87 • 10“^. 


5.3 Expected sensitivity and bands 

The expected sensitivity for limits and discovery are useful quantities, though subject to some degree 
of ambiguity. Intuitively, the expected upper limit is the upper limit one would expect to obtain if 
the background-only hypothesis is true. Similarly, the expected significance is the significance of the 
observation assuming the standard model signal rate (at some m//). To find the expected limit one 
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needs a distribution = 0,0). To find the expected significance one needs fhe disfribufion 

f{Z\fi = 1, 6) or, equivalenfly, f{po\^J, = 1, 0). We use fhe median insfead of fhe mean, as if is invarianf 
fo fhe choice of Z or pQ. More imporfanfly, is fhaf fhe expecfed limif and significance depend on fhe 
value of fhe nuisance paramefers 6, for which we do nof know fhe frue values. Thus, fhe expecfed limit 
and significance will depend on some convenfion for choosing 6. While many nuisance parameters have 
a nominal esfimafe (i.e. fhe global observables in fhe consfrainf terms), ofhers do nof (eg. fhe exponenf in 
fhe if —)• 77 background model). Thus, we choose a convenfion fhaf freafs all of fhe nuisance parameters 
consisfenfly, which is fhe profiled value based on fhe observed dafa. Thus for fhe expecfed limif we use 

f{pup\0, d{p = 0,obs)) and for fhe expecfed significance we use f{po\p = = l,obs)). An 

uninfuifive and possibly undesirable fealure of fhis choice is fhaf fhe expecfed limif and significance 
depend on fhe observed dafa fhrough fhe convenfional choice for 6. 

Wifh fhese disfribufions we can also define bands around fhe median upper limif. Our sfandard 
limit plot shows a dark green band corresponding to fi±i defined by 



f{pnp\0, 0{p = 0, obs))(i/rup 


= $"^(± 1 ) 


and a lighf yellow band corresponding fo p ±2 defined by 



/(^up|0, 0(/i = 0, obs))fi^up 


= $“^(± 2 ) 


(62) 


(63) 


5.4 Ensemble of pseudo-experiments generated with “Toy” Monte Carlo 

The p-values in the procedure described above require performing several integrals. In the case of the 
asymptotic approach, the distributions for and qo are known and the integral is performed directly. 
When the distributions are not assumed to take on their asymptotic form, then they must be constructed 
using Monte Carlo methods. In the “toy Monte Carlo” approach one generates pseudo-experiments in 
which the number of events in each channel Uc, the values of the discriminating variables {xec} for each 
of those events, and the auxiliary measurements (global observables) are all randomized according to 
ftot- We denote the resulting data Ptoy and global observables Gtoy By doing this several times one can 
build an ensemble of pseudo-experiments and evaluate the necessary integrals. Recall that Monte Carlo 
techniques can be viewed as a form of numerical integration. 

The fact that the auxiliary measurements ap are randomized is unfamiliar in particle physics. The 
more familiar approach for toy Monte Carlo is that the nuisance parameters are randomized. This re¬ 
quires a distribution for the nuisance parameters, and thus corresponds to a Bayesian treatment of the 
nuisance parameters. The resulting p-values are a hybrid Bayesian-Frequentist quantity with no consis¬ 
tent definition of probability. To maintain a strictly frequentist procedure, the corresponding operation is 
to randomize the auxiliary measurements. 

While formally this procedure is well motivated, as physicists we also know that our models can 
have deficiencies and we should check that the distribution of the auxiliary measurements does not devi¬ 
ate too far from our expectations. In Section ?? we show the distribution of the auxiliary measurements 
and the corresponding 6 from the toy Monte Carlo technique. 

Technically, the pseudo-experiments are generated with the RooStats ToyMCSampler, which is 
used by the higher-level tool FrequentistCalculator, which is in turn used by HypoTestInverter. 


5.5 Asymptotic Formulas 

The following has been extracted from Ref. Q and has been reproduced here for convenience. The 
primary message of Ref. Q is that for a sufficiently large data sample the distributions of the likelihood 
ratio based test statistics above converge to a specific form. In particular, Wilks’s theorem p9| can be 
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used to obtain the distribution /(A(^)|/i), that is the distribution of the test statistic A(/x) when fj, is true. 
Note that the asymptotic distribution is independent of the value of the nuisance parameters. Wald’s 
theorem 130| provides the generalization to /(A(^)|^', 6), that is when the true value is not the same as 
the tested value. The various formulae listed below are corollaries of Wilks’s and Wald’s theorems for 
the likelihood ratio test statistics described above. The Asimov data described immediately below was a 
novel result of Ref. |[T|. 


5.5.1 The Asimov data and a = var(fi) 

The asymptotic formulae below require knowing the variance of the maximum likelihood estimate of fj, 

a = var[/i] . (64) 

One result of Ref. o is that a can be estimated with an artificial dataset referred to as the Asimov dataset. 
The Asimov dataset is defined as a binned dataset, where the number of events in bin b is exactly the 
number of events expected in bin b. Note, this means that the dataset generally has non-integer number 
of events in each bin. For our general model one can write 


nb,A = / i'{a)f{x\oc)dx (65) 

J irGbin h 

where the subscript A denotes that this is the Asimov data. Note, that the dataset depends on the value of 
OL implicitly. For an model of unbinned data, one can simply take the limit of narrow bin widths for the 
Asimov data. We denote the likelihood evaluated with the Asimov data as Lp^{p). The important result 
is that one can calculate the expected Fisher information of Eq. |^by computing the observed Fisher 
information on the likelihood function based on this special Asimov dataset. 

A related and convenient way to calculate the variance of fi is 


(7 ~ 



( 66 ) 


where is the to use the test statistic based on a background-only Asimov data (ie. 
with/i = 0 in Eq. 651. It is worth noting that higher-order corrections to the formulae below 
developed to address the case when the variance of fi depends strongly on 


the one 
are being 


5.5.2 Asymptotic Formulas for qo 

Eor a sufficiently large data sample, the pdf f{qo\p.') is found to approach 



(67) 


Eor the special case of p! = 0, this reduces to 


/(9o|0) 


2'5(7o) + 


1 1 1 
2 ^/2ti 


e 


iJo/2 


( 68 ) 


That is, one finds a mixture of a delta function at zero and a chi-square distribution for one degree of 
freedom, with each term having a weight of 1/2. In the following we will refer to this mixture as a half 
chi-square distribution or Ixi- 


Erom Eq. (67 1 the corresponding cumulative distribution is found to be 


F{qo\i^') 




(69) 
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The important special case /u' = 0 is therefore simply 

F{qo\0) = ■ 

The p-value of the p = 0 hypothesis is 

po = 1 - F{qo\0) , 

and therefore for the significance gives the simple formula 

Z = ^>“1(1 -po) = y/% . 


(70) 


(71) 


(72) 


5.5.3 Asymptotic Formulas for 

For a sufficiently large data sample, the pdf f{qij\p) is found to approach 


fW) = ^ 


F - F 
a 




+ 


1 _L 

2 


= exp 




/\2 


exp 


^ yJl'KG 

The special case p = p' is therefore 


1 




0 < 5 ;, < fjp/cr'^ 
q^ > p^/cr^ 


_ \ _ \ _g 

2 


^ <q^l< 


, \/2'kc 


exp 


' 2 (2m/^)" 


% > 


The corresponding cumulative distribution is 




F{%y) = < 

The special case p = p' is 

F{q^l\p) = < 


2/7,/(T 


0 < qfj.< 

% > • 


$ 


2 /i/(T 


(73) 


(74) 


(75) 


0 < q/j. < p^/cr^ , 

% > ■ 

The p-value of the hypothesized p is as before given by one minus the cumulative distribution, 


(76) 


F/. = 1 -• (77) 

As when using the upper limit on p at confidence level 1 — a is found by selling = a and 
solving for p, which reduces lo fhe same resulf as found when using q^, namely. 

Pup = P + 0-4>“^(l - a) . (78) 

Nofe fhaf because a depends in general on p, Eq. (|7^ musl be solved numerically. 
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5.5.4 Expected CLg Limit and Bands 

For the CLg method we need distributions for for the hypothesis at jj. and /i = 0. We find 


, 1 - 
P/. = 


The median and expected error bands will therefore be 

piup+N = - Q$(iV)) + N) 

with 


2 F 
a = 


Q/i,A 


(79) 


(80) 


(81) 


a = 0.05, p, can be taken as in the calculation of a. Note that for iV = 0 we find the median limit 


,med 


= 1(1-0.5a) 


(82) 


The fact that a (the variance of p) defined in Eq. 66 in general depends on p complicates situations 
and can lead to some discrepancies between the correct value of the bands and those obtained with the 
equation above. The bands tend to be too narrow. A modified treatment of the bands taking into account 
the p dependence of a is under development. 


5.6 Importance Sampling 

[The following section has been adapted from text written primarily by Sven Kreiss, Alex Read, and 
myself for the ATLAS Higgs combination. It is reproduced here for convenience. ] 

To claim a discovery, it is necessary to populate a small tail of a test statistic distribution. Toy 
Monte-Carlo techniques use the model ftot to generate toy data Vtoy. For every pseudo-experiment 
(toy), the test statistic is calculated and added to the test statistic distribution. Building this distribution 
from toys is independent of the assumptions that go into the asymptotic calculation that describes this 
distribution with an analytic expression. Recently progress has been made using Importance Sampling to 
populate the extreme tails of the test statistic distribution, which is much more computationally intensive 
with standard methods. The presented algorithms are implemented in RooStats ToyMCSampler. 


5.6.1 Naive Importance Sampling 

An ensemble of "standard toys" is generated from a model representing the Null hypothesis with p = 0 
and the nuisance parameters 6 fixed at their profiled values fo the observed data Sobs^ written 
ftot(77sim, G\p = 0, 0obs)- With importance sampling however, the underlying idea is to generate toys 
from a different model, called the importance density. A valid importance density is for example the 
same model with a non-zero value of p. The simple Likelihood ratio is calculated for each toy and used 
as a weight. 

. , , ("tot(27toy ^GtoylP — Oj^obs) 

r \ - I a —7 

Itotftv^Qy, ytoyl/^ — P j t/obs) 


The weighted distribution is equal to a distribution of unweighted toys generated from the Null. 
The choice of the importance density is a delicate issue. Michael Woodroofe presented a prescription for 
creating a well behaved importance density 1311. Unfortunately, this method is impractical for models as 
large as the combined Higgs models. An alternative approach is shown below. 
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5.6.2 Phase Space Slicing 

The first improvement from naive importance sampling is the idea of taking toys from both, the null 
density and the importance density. There are various ways to do that. Simply stitching two test statistic 
distributions together at an arbitrary point has the disadvantage that the normalizations of both distribu¬ 
tions have to be known. 

Instead, it is possible to select toys according to their weights. First, toys are generated from the 
Null and the simple Likelihood ratio is calculated. If it is larger than one, the toy is kept and otherwise 
rejected. Next, toys from the importance density are generated. Here again, the simple Likelihood ratio 
is calculated but this time the toy is rejected when the Likelihood ratio is larger than one and kept when 
it is smaller than one. If kept, the toy’s weight is the simple Likelihood ratio which is smaller than one 
by this prescription. 

In the following section, this idea is restated such that it generalizes to multiple importance densi¬ 
ties. 


5.6.3 Multiple Importance Densities 

The above procedure for selecting and reweighting toys that were generated from both densities can be 
phrased in the following way: 

- A toy is generated from a density with p, = p.' and the Likelihoods ftot(^^toy, Qtoy\p = 0, 0obs) 
and ftot(Aoy, Qtoy\p = p', Oohs) are calculated. 

- The toy is veto-ed when the Likelihood with p = p' is not the largest. Otherwise, the toy is used 
with a weight that is the ratio of the Likelihoods. 

This can be generalized to any number of densities with pi = {0, p', p ",..For the toys generated 
from model i: 


veto, if ftot(^toy) ^toy If — pii^ohs) 7^ maX-j^ f^ot (^toy j ^toy | F — Pji^ohs) ■ Pj — {Oj F jF )■••}} 


weight 


ftot(^toy) ^toylF — 0) ^obs) 
ftot(^toy, G toy If — F*) ^obs) 


(83) 

(84) 


The number of importance densities has to be known when applying the vetos. It should not be 
too small to cover the parameter space appropriately and it should not be too large, because too many 
importance densities lead to too many vetoed toys which decreases overall efficiency. The value and 
error of p from a fit to data can be used to estimate the required number of importance densities for a 
given target overlap of the distributions. 


The sampling efficiency in the tail can be further improved by generating a larger number of toys 
for densities with larger values of p. For example, for n densities, one can generate 2}^ jT^ = 
of the overall toys per density k with A: = 0,..., n — 1. The toys have to be re-weighted for example 
by resulting in a minimum re-weight factor of one. The current implementation of the error 

calculation for the p-value is independent of an overall scale in the weights. 


The method using multiple importance densities is similar to Michael Woodroofe’s |311 prescrip¬ 
tion of creating a suitable importance density with an integral over p. In the method presented here, the 
integral is approximated by a sum over discrete values of p. Instead of taking the sum, a mechanism that 
allows for multiple importance densities is introduced. 


5.7 Look-elsewhere effect, trials factor, Bonferoni 

Future versions of this document will discuss the so-called look-elsewhere effect in more detail. Here 


we point to the primary development recently: [32 331. 
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Fig. 11: An example sampling of a test statistic distribution using three densities, the original null density and two 
importance densities. 


5.8 One-sided intervals, CLs, power-constraints, and Negatively Biased Relevant Subsets 


Particle physicists regularly set upper-limits on cross sections and other parameters that are bounded to 
be non-negative. Standard frequentist confidence intervals should nominally cover at the stated value. 
The implication that a 95% confidence level upper-limif covers fhe frue value 95% of fhe lime is fhaf if 
doesnT cover fhe frue value 5% of fhe lime. This is frue no mailer how small fhe cross section is. Thai 
means fhaf if Ihere is no signal presenl, 5% of fhe time we would be excluding any positive value of 
fhe cross-seclion. Experimenlalisls do nol like Ibis since we would nol consider ourselves sensilive lo 
arbilrarily small signals. 


Two main approaches have been proposed lo prolecl from excluding signals lo which we do nol 


consider ourselves sensitive. The firsl is Ihe CLs procedure inlroduced by Read and described above 125 
^7\. The CLs procedure produce intervals lhal over-cover 


meaning lhal Ihe intervals cover Ihe frue 
value more lhan Ihe desired level. The coverage for small values of Ihe cross-section approaches 100%, 
while for large values of Ihe cross section, where Ihe experimenl does have sensilivily, Ihe coverage 
converges lo Ihe nominal level (see Lig. [T2] ). Unfortunately, Ihe coverage for intermediate values is 
nol immediately accessible wilhoul more delailed sludies. Interestingly, Ihe modified frequenlisl CLs 
procedure reproduces Ihe one-sided upper limil from a Bayesian procedure wilh a uniform prior on Ihe 
cross section for simple models like number counting analyses. Even in very complicated models we see 
very good numerical agreemenl belween CLs and Ihe Bayesian approach, even Ihough Ihe inlerprelalion 
of Ihe numbers is different 


An alternate approach called power-conslrained limils (PCL) is lo leave Ihe slandard frequenlisl 
procedure unchanged while adding an additional requiremenl for a parameter poinl lo be considered 
‘excluded’. The additional requiremenl is direclly a measure of Ihe sensitivity of lo lhal parameter poinl 
based on Ihe notion of power (or Type II error). This approach makes Ihe coverage of Ihe procedure 
manifesl | [28l . 


Surprisingly, one-sided upper limils on a bounded parameter are a sublie topic lhal has led to 
debates among Ihe experls of slalislics in Ihe collaborations and a siring of interesting articles from 
slalislicians. The discussion is beyond Ihe scope of Ihe currenl version of Ihese notes, bul Ihe interested 
reader is invited and encouraged to read 1341 and Ihe responses from nolable slalislicians on Ihe topic. 
More recenlly Cousins tried to formalize the sensitivity problem in terms of a concept called Negatively 
Biased Relevant Subsets (NBRS) 1351. While the power-constrained limits do not formally emit NBRS, 
it is an interesting insight. Even more recently, Vitells has found interesting connections with CLs and 
the work of Bimbaum [TtH^ . This connection is significant since statisticians have primarily seen CLs 


as an ad hoc procedure mixing the notion of size and power with no satisfying properties. 
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Fig. 12: Taken from Fig.3 of |[28j: (a) Upper limits from the PCL (solid), CLs and Bayesian (dashed), and classical 
(dotted) procedures as a function of /i), which is assumed to follow a Gaussian distribution with unit standard 
deviation, (b) The corresponding coverage probabilities as a function of p. 


6 Bayesian Procedures 

[This section is far from complete. Some key practical issues and references to other literature are given.] 

Unsurprisingly, Bayesian procedures are based on Bayes’s theorem as in Eq. [^and Eq. The 
Bayesian approach requires one to provide a prior over the parameters, which can be seen either as 
an advantage or a disadvantage p7| 38 1. In practical terms, one typically wants to build the posterior 
distribution for the parameter of interest. This typically requires integrating, or marginalizing, over all 
the nuisance parameters as in Eq. These integrals can be over very high dimensional posteriors 
with complicated structure. One of the most powerful algorithms for this integration is Markov Chain 
Monte Carlo, described below. In terms of the prior one can either embrace the subjective Bayesian 
approach |39| or take a more ’objective’ approach in which the prior is derived from formal rules. Eor 
instance, Jeffreys’s Prior 


or their generalization in terms of Reference Priors 1411. 

Given the logical importance of the choice of prior, it is generally recommended to try a few 
options to see how the result numerically depends on the choice of priors (i.e.. sensitivity analysis). This 
leads me to a few great quotes from prominent statisticians: 

“Sensitivity analysis is at the heart of scientific Bayesianism” -Michael Goldstein 

“Perhaps the most important general lesson is that the facile use of what appear to be uninformative 
priors is a dangerous practice in high dimensions” -Brad Efron 

“Meaningful prior specification of beliefs in probabilistic form over very large possibility spaces 
is very difficult and may lead to a lot of arbitrariness in the specification” - Michael Goldstein 

“Objective Bayesian analysis is the best frequentist tool around” -Jim Berger 


6.1 Hybrid Bayesian-Frequentist methods 

It is worth mentioning that in particle physics there has been widespread use of a hybrid Bayesian- 
Erequentist approach in which one marginalizes nuisance parameters. Perhaps the most well known 
example is due to a paper by Cousins and Highland p^. In some instances one obtains a Bayesian- 
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averaged model that depends only on the parameters of interest 


f(2?|apoi) 


ftot (^|Q^)'^(^nuis) dc^miis 


(85) 


and then proceeds with the typical frequentist methodology for calculating p-values and constructing 
confidence intervals. Note, in this approach the constraint terms that are appended to fsim of Eq. |^to 
obtain ftot of Eq.|^are interpreted as in Eq. [^and ?y(Q:nuts) is usually a uniform prior. Eurthermore, the 
global observables or auxiliary measurements Op are typically left fixed to their nominal or observed val¬ 
ues and not randomized. In other variants the full model without constraints fsim(P|Q;) is used to define 
the test statistic but the distribution of the test statistic is obtained by marginalizing (or randomizing) the 
nuisance parameters as in Eq.[^ See the following references for more details ||4]43|-49). 


The shortcomings of this approach are that the coverage is not guaranteed and the method uses an 
inconsistent notion of probability. Thus it is hard to define exactly what the p-values and intervals mean 
in a formal sense. 


6.2 Markov Chain Monte Carlo and the Metropolis-Hastings Algorithm 


The Metropolis-Hastings algorithm is used to construct a Markov chain {oti}, where the samples rvj 
are proportional to the target posterior density or likelihood function. The algorithm requires a proposal 
function (5 (q:|q:') that gives the probability density to propose the point a given that the last point in 
the chain is a.'. Note, the density only depends on the last step in the chain, thus it is considered a 
Markov process. At each step in the algorithm, a new point in parameter space is proposed and possibly 
appended to the chain based on its likelihood relative to the current point in the chain. Even when 
the proposal density function is not symmetric. Metropolis Hastings maintains ‘detailed balance’ when 
constructing the Markov chain by counterbalancing the relative likelihood between the two points with 
the relative proposal density. That is, given the current point a, proposed point ex', likelihood function 
L, and proposal density function Q, we visit ex' if and only if 


L{(x') Q{(x\cx') 
L{<x) Q{(x'\ex) 


> Rand[0, 1] 


( 86 ) 


Note, if the proposal density is symmetric, Q{cx\(x') = Q{(x'\(x), then the ratio of the proposal densities 
can be neglected (which can be computationally expensive). Above we have written the algorithm to 
sample the likelihood function L{(x), but typically one would use the posterior 7r(Q;). Within RooStats 
the Metropolis-Hastings algorithm is implemented with the MetropolisHastings class, which returns 
a MarkovChain. Another powerful tool is the Bayesian Analysis Toolkit (BAT) |50|. Note, one can use 
a RooFit / RooStats model in the BAT environment. 


Note, an alternative to Markov Chain Monte Carlo is the nested sampling approach of Skilling | [5T[ 
and the MultiNest implementation |52|. 


Easily, we mention that sampling algorithms associated to Bayesian belief networks and graphical 
models may offer enormous advantages to both MCMC and nested sampling due to the fact that they can 
take advantage of the conditional dependencies in the model. 


6.3 Jeffreys’s and Reference Prior 


One of the great advances in Bayesian methodology was the introduction of Jeffreys’s rule for selecting 
a prior based on a formal rule p0| . The rule selects a prior that is invariant under reparametrization of 
the observables and covariant with reparametrization of the parameters. The rule is based on information 
theoretic arguments and the prior is given by the square root of the determinant of the Eisher information 
matrix, which we first encountered in Eq.|^ 


7r(Q:) = Y^det T.^^,{cx) = yt 


'r -52 log ftot a) 

ftot(^|Q:) -^- dV 

OOipOlpi 


(87) 


40 











While the right-most form of the prior looks daunting with complex integrals over partial derivatives, 
the Asimov data described in Sec. |5.5.1 and Ref. |T| provide a convenient way to calculate the Fisher 
information. Fig. [T^ and [T4] show examples of RooStats numerical algorithm for calculating Jeffreys’s 
prior compared to analytic results on a simple Gaussian and a Poisson model. 


RooWorkspace w{"w"); 

w.factory{"Gaussian;:g(x[0,-20,20],mu[0,-5,5],sigma[1,0,10])"); 
w.factory{"n[10,.1,200]"); 
w.factory{"ExtendPdf::p(g,n)"); 
w.var{"n")->setConstant(); 

w.var{"sigma")->setConstant(); 
w.defineSet("poi","mu"); 
w.defineSet("obs","x"); 

RooJeffreysPrior pi("jeffreys","jeffreys",*w.pdf("p"),*w.set("poi"),*w.set("obs")); 



Fig. 13: Example code making a Gaussian distribution (with 10 events expected) and the Jeffreys Prior for /r and 
(j calculated numerically in RooStats and compared to the analytic result. 


RooWorkspace w("w"); 
w. factory("Uniform::u(x[0,1])' 
w.factory("mu[ 10 0,1,2 0 0]"); 
w.factory("ExtendPdf::p(u,mu)' 

w. defineSet("poi","mu"); 
w.defineSet("obs","x"); 

// w.defineSet("obs2","n"); 



RooJeffreysPrior pi("jeffreys" , "jeffreys" , *w.pdf("p"),*w.set("poi") , *w.set("obs")); 


Fig. 14: Example code making a Poisson distribution (with 100 replications expected) and the Jeffreys Prior for p 
calculated numerically in RooStats and compared to the analytic result. 


Unfortunately, Jeffreys’s prior does not behave well in multidimensional problems. Based on a 
similar information theoretic approach, Bernardo and Berger have developed the Reference priors |53- 
|56) and the associated Reference analysis. While attractive in many ways, the approach is fairly difficult 
to implement. Recently, there has been some progress within the particle physics context in deriving the 
reference prior for problems relevant to particle physics ||4T][57|. 


6.4 Likelihood Principle 

For those interested in the deeper and more philosophical aspects of statistical inference, the likelihood 
principle is incredibly interesting. This section will be expanded in the future, but for now I simply 
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suggest searching on the internet, the Wikipedia article, and Ref. 1361. In short the principle says that 
all inference should be based on the likelihood function of the observed data. Frequentist procedures 
violate the likelihood principle since p-values are tail probabilities associated to hypothetical outcomes 
(not the observed data). Generally, Bayesian procedures and those based on the asymptotic properties of 
likelihood tests do obey the likelihood principle. Somewhat ironically, the objective Bayesian procedures 
such as Reference priors and Jeffreys’s prior can violate the likelihood principle since the prior is based 
on expectations over hypothetical outcomes. 


7 Unfolding 


Another topic for the future. The basic aim of unfolding is to try to correct distributions back to the true 
underlying distribution before detector ’smearing’. For now, see |58 -65|. 


8 Conclusions 

It was a pleasure to lecture at the 2011 ESHEP school in Cheile Gradistei and the 2013 CEASHEP school 
in Peru. Quite a bit of progress has been made in the last few years in terms of statistical methodology, in 
particular the formalization of a fully frequentist approach to incorporating systematics, a deeper under¬ 
standing of the look-elsewhere effect, the development of asymptotic approximations of the distributions 
important for particle physics, and in roads to Bayesian reference analysis. Eurthermore, most of these 
developments are general purpose and can be applied across diverse models. While those developments 
are interesting, the most important area for most physicists to devote their attention in terms of statistics 
is to improve the modeling of the data for his or her individual analysis. 
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